LGMay 6
Feature Starvation as Geometric Instability in Sparse AutoencodersFaris Chaudhry, Keisuke Yano, Anthea Monod
Sparse autoencoders (SAEs) are used to disentangle the dense, polysemantic internal representations of large language models (LLMs) into interpretable, monosemantic concepts. However, standard $\ell_1$-regularized SAEs suffer from feature starvation (dead neurons) and shrinkage bias, often requiring computationally expensive heuristic resampling and nondifferentiable hard-masking methods to bypass these challenges. We argue that feature starvation is not merely an empirical artifact of poor data diversity, but a fundamental optimization-geometric pathology of overcomplete dictionaries: the $\ell_1$-induced sparse coding map is unstable and fundamentally misaligned with shallow, amortized encoders. To address this structural instability, we introduce adaptive elastic net SAEs (AEN-SAEs), a fully differentiable architecture grounded in classical sparse regression. AEN-SAEs combine an $\ell_2$ structural term that enforces strong convexity and Lipschitz stability with adaptive $\ell_1$ reweighting that eliminates shrinkage bias and suppresses spurious features, thereby jointly controlling the curvature and interaction structure of the induced polyhedral geometry. Theoretically, we show that AEN-SAEs yield a Lipschitz-continuous sparse coding map and recover the global feature support under mild assumptions. Empirically, across synthetic settings and LLMs (Pythia 70M, Llama 3.1 8B), AEN-SAEs mitigate feature starvation without auxiliary heuristics while maintaining competitive reconstruction abilities.
OCApr 18
Trajectory-Restricted Optimization Conditions and Geometry-Aware Linear ConvergenceFaris Chaudhry, Anthea Monod, Keisuke Yano
Linear convergence of first-order methods is typically characterized by global optimization conditions whose constants reflect worst-case geometry of the ambient space. In high-dimensional or structured problems, these global constants can be arbitrarily conservative and fail to capture the geometry actually encountered by optimization trajectories. In this paper, we develop a trajectory-restricted framework for linear convergence based on localized geometric regularity. We introduce restricted variants of the Polyak--Łojasiewicz inequality, error bound, and quadratic growth conditions that are required to hold only on subsets of the domain. We show that classical convergence guarantees extend under these localized conditions, and in key cases, we develop new arguments that yield explicit relationships between the corresponding constants. The resulting rates are governed by geometric quantities associated with the regions traversed by the algorithm. For polyhedral composite problems, we prove that convergence is controlled by restricted Hoffman constants corresponding to the active polyhedral faces visited along the trajectory. Once the iterates enter a well-conditioned face, the effective condition number improves accordingly. Our work provides a geometric quantification for fast local convergence after active-set or manifold identification and more broadly suggests that linear convergence is fundamentally governed by the geometry of the subsets explored by the algorithm, rather than by worst-case global conditioning.
LGMay 16
The Geometry of Projection Heads: Conditioning, Invariance, and CollapseFaris Chaudhry
We develop a geometric theory of projection heads in self-supervised learning by modeling the head as a trainable Riemannian metric on the backbone representation manifold. We show that linear heads perform implicit subspace whitening, while nonlinear heads adapt local metrics to satisfy the specific topological constraints of the loss, with head depth empirically dictating this capacity. Analyzing dimensional collapse, we prove that smooth nonlinear heads natively induce negative eigenvalues in the Hessian at collapsed equilibria, making them unstable. We empirically validate this by continuously tracking the optimization geometry during training, which reveals that smooth activations like Swish can generate explicit negative curvature to escape collapse, whereas linear and ReLU heads under continuous-time gradient flow cannot, relying instead on discrete-time optimization dynamics and BatchNorm. Finally, we geometrically characterize how metric degeneracy governs the information-invariance trade-off, explaining why the head must be discarded. Evaluated across contrastive and decorrelation-based objectives on foundation models, our results demonstrate that the projection head acts as a universal geometric buffer, decoupling the semantic backbone from the rigid, destructive constraints of the pretraining objective.
LGMar 11
Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-ContextFaris Chaudhry, Siddhant Gadkari
In-context learning (ICL) allows Transformers to adapt to novel tasks without weight updates, yet the underlying algorithms remain poorly understood. We adopt a statistical decision-theoretic perspective by investigating simple binary hypothesis testing, where the optimal policy is determined by the likelihood-ratio test. Notably, this setup provides a mathematically rigorous setting for mechanistic interpretability where the target algorithmic ground truth is known. By training Transformers on tasks requiring distinct geometries (linear shifted means vs. nonlinear variance estimation), we demonstrate that the models approximate the Bayes-optimal sufficient statistics from context up to some monotonic transformation, matching the performance of an ideal oracle estimator in nonlinear regimes. Leveraging this analytical ground truth, mechanistic analysis via logit lens and circuit alignment suggests that the model does not rely on a fixed kernel smoothing heuristic. Instead, it appears to adapt the point at which decisions become linearly decodable: exhibiting patterns consistent with a voting-style ensemble for linear tasks while utilizing a deeper sequential computation for nonlinear tasks. These findings suggest that ICL emerges from the construction of task-adaptive statistical estimators rather than simple similarity matching.
LGMar 13
Asymptotic and Finite-Time Guarantees for Langevin-Based Temperature Annealing in InfoNCEFaris Chaudhry
The InfoNCE loss in contrastive learning depends critically on a temperature parameter, yet its dynamics under fixed versus annealed schedules remain poorly understood. We provide a theoretical analysis by modeling embedding evolution under Langevin dynamics on a compact Riemannian manifold. Under mild smoothness and energy-barrier assumptions, we show that classical simulated annealing guarantees extend to this setting: slow logarithmic inverse-temperature schedules ensure convergence in probability to a set of globally optimal representations, while faster schedules risk becoming trapped in suboptimal minima. Our results establish a link between contrastive learning and simulated annealing, providing a principled basis for understanding and tuning temperature schedules.
LGMar 13
Scaling Laws and Pathologies of Single-Layer PINNs: Network Width and PDE NonlinearityFaris Chaudhry
We establish empirical scaling laws for Single-Layer Physics-Informed Neural Networks on canonical nonlinear PDEs. We identify a dual optimization failure: (i) a baseline pathology, where the solution error fails to decrease with network width, even at fixed nonlinearity, falling short of theoretical approximation bounds, and (ii) a compounding pathology, where this failure is exacerbated by nonlinearity. We provide quantitative evidence that a simple separable power law is insufficient, and that the scaling behavior is governed by a more complex, non-separable relationship. This failure is consistent with the concept of spectral bias, where networks struggle to learn the high-frequency solution components that intensify with nonlinearity. We show that optimization, not approximation capacity, is the primary bottleneck, and propose a methodology to empirically measure these complex scaling effects.