Shu Ge

h-index7

4papers

17citations

Novelty64%

AI Score44

Ranked #75,770 of 201,326 authors (top 38%)#17,139 in LG (top 40%)

4 Papers

LGNov 16, 2023

A Knowledge Distillation Approach for Sepsis Outcome Prediction from Multivariate Clinical Time Series

Anna Wong, Shu Ge, Nassim Oufattole et al.

Sepsis is a life-threatening condition triggered by an extreme infection response. Our objective is to forecast sepsis patient outcomes using their medical history and treatments, while learning interpretable state representations to assess patients' risks in developing various adverse outcomes. While neural networks excel in outcome prediction, their limited interpretability remains a key issue. In this work, we use knowledge distillation via constrained variational inference to distill the knowledge of a powerful "teacher" neural network model with high predictive power to train a "student" latent variable model to learn interpretable hidden state representations to achieve high predictive performance for sepsis outcome prediction. Using real-world data from the MIMIC-IV database, we trained an LSTM as the "teacher" model to predict mortality for sepsis patients, given information about their recent history of vital signs, lab values and treatments. For our student model, we use an autoregressive hidden Markov model (AR-HMM) to learn interpretable hidden states from patients' clinical time series, and use the posterior distribution of the learned state representations to predict various downstream outcomes, including hospital mortality, pulmonary edema, need for diuretics, dialysis, and mechanical ventilation. Our results show that our approach successfully incorporates the constraint to achieve high predictive power similar to the teacher model, while maintaining the generative performance.

LGFeb 21

Boosting for Vector-Valued Prediction and Conditional Density Estimation

Jian Qian, Shu Ge

Despite the widespread use of boosting in structured prediction, a general theoretical understanding of aggregation beyond scalar losses remains incomplete. We study vector-valued and conditional density prediction under general divergences and identify stability conditions under which aggregation amplifies weak guarantees into strong ones. We formalize this stability property as \emph{$(α,β)$-boostability}. We show that geometric median aggregation achieves $(α,β)$-boostability for a broad class of divergences, with tradeoffs that depend on the underlying geometry. For vector-valued prediction and conditional density estimation, we characterize boostability under common divergences ($\ell_1$, $\ell_2$, $\TV$, and $\Hel$) with geometric median, revealing a sharp distinction between dimension-dependent and dimension-free regimes. We further show that while KL divergence is not directly boostable via geometric median aggregation, it can be handled indirectly through boostability under Hellinger distance. Building on these structural results, we propose a generic boosting framework \textsc{GeoMedBoost} based on exponential reweighting and geometric-median aggregation. Under a weak learner condition and $(α,β)$-boostability, we obtain exponential decay of the empirical divergence exceedance error. Our framework recovers classical algorithms such as \textsc{MedBoost}, \textsc{AdaBoost}, and \textsc{SAMME} as special cases, and provides a unified geometric view of boosting for structured prediction.

LGOct 24, 2025

Normalization in Attention Dynamics

Nikita Karagodin, Shu Ge, Yury Polyanskiy et al.

We study the effect of normalization schemes on token representations in deep transformers. Modeling their evolution as interacting particles on the sphere, we show that normalization acts as a form of speed regulation. This perspective enables a unified analysis of several schemes -- including Post-LN, Pre-LN, Mix-LN, Peri-LN, nGPT -- revealing how they influence clustering dynamics and representation collapse. Our framework clarifies how different schemes shape token representations across layers and provides a principled basis for comparing them, identifying Peri-LN as a particularly effective choice.

LGMay 1, 2023

Model-agnostic Measure of Generalization Difficulty

Akhilan Boopathy, Kevin Liu, Jaedong Hwang et al.

The measure of a machine learning algorithm is the difficulty of the tasks it can perform, and sufficiently difficult tasks are critical drivers of strong machine learning models. However, quantifying the generalization difficulty of machine learning benchmarks has remained challenging. We propose what is to our knowledge the first model-agnostic measure of the inherent generalization difficulty of tasks. Our inductive bias complexity measure quantifies the total information required to generalize well on a task minus the information provided by the data. It does so by measuring the fractional volume occupied by hypotheses that generalize on a task given that they fit the training data. It scales exponentially with the intrinsic dimensionality of the space over which the model must generalize but only polynomially in resolution per dimension, showing that tasks which require generalizing over many dimensions are drastically more difficult than tasks involving more detail in fewer dimensions. Our measure can be applied to compute and compare supervised learning, reinforcement learning and meta-learning generalization difficulties against each other. We show that applied empirically, it formally quantifies intuitively expected trends, e.g. that in terms of required inductive bias, MNIST < CIFAR10 < Imagenet and fully observable Markov decision processes (MDPs) < partially observable MDPs. Further, we show that classification of complex images < few-shot meta-learning with simple images. Our measure provides a quantitative metric to guide the construction of more complex tasks requiring greater inductive bias, and thereby encourages the development of more sophisticated architectures and learning algorithms with more powerful generalization capabilities.