Nilesh Sarkar

11.3LGMay 9

Causal Dimensionality of Transformer Representations: Measurement, Scaling, and Layer Structure

Nilesh Sarkar, Dawar Jyoti Deka

Sparse autoencoders (SAEs) decompose transformer residual streams into interpretable feature dictionaries, yet the relationship between SAE width and causal influence on model output has not been systematically characterised. We introduce causal dimensionality kappa(L, M, T), defined as the effective rank of the expected Jacobian outer product at layer L, and show it can be estimated via the SAE width sweep paired with attribution patching. Across seven SAE widths from 16,384 to 1,048,576 features on Gemma-2-2B layer 12, representational capacity grows 15.6x while causal capacity grows only 4.35x: a robust separation we term the representational-causal wedge. A saturating fit yields kappa-hat approximately 1,990 with kappa-hat / d_model = 0.86 and participation-ratio lower bound kappa_PR approximately 280. Crucially, kappa is invariant to model scaling: Gemma-2-9B and Gemma-2-2B yield identical N_causal = 328 at the same SAE width despite a 3.46x parameter increase (the count is forced to 2% of SAE width by calibration; the substantive empirical claim is shape invariance of the AtP score distribution under matched seq=512 conditions). Across eight network depths kappa is constant while the absolute attribution threshold drops 20x from layer 1 to layer 23. Five controls (architecture invariance, threshold robustness, geometric privilege, synthetic ground-truth recovery, and a four-cell encoder/decoder ablation) pin down what kappa measures and what it does not. Our findings establish kappa as a measurable, model-intrinsic property of transformer layers: sub-linearly recoverable by SAE width, invariant to model scaling, and structured across network depth.

10.8LGApr 5

Geometric Limits of Knowledge Distillation: A Minimum-Width Theorem via Superposition Theory

Dawar Jyoti Deka, Nilesh Sarkar

Knowledge distillation compresses large teachers into smaller students, but performance saturates at a loss floor that persists across training methods and objectives. We argue this floor is geometric: neural networks represent far more features than dimensions through superposition, and a student of width $d_S$ can encode at most $d_S \cdot g(Î±)$ features, where $g(Î±) = 1/((1-Î±)\ln\frac{1}{1-Î±})$ is a sparsity-dependent capacity function. Features beyond this budget are permanently lost, yielding an importance-weighted loss floor. We validate on a toy model (48 configurations, median accuracy >93%) and on Pythia-410M, where sparse autoencoders measure $F \approx 28{,}700$ features at $Î±\approx 0.992$ (critical width $d_S^* \approx 1{,}065$). Distillation into five student widths confirms the predicted monotonic floor ordering. The observed floor decomposes into a geometric component and a width-independent architectural baseline ($R^2 = 0.993$). Linear probing shows coarse concepts survive even 88% feature loss, revealing the floor arises from aggregate loss of fine-grained features in the importance distribution's long tail. Our results connect representation geometry to distillation limits and provide a practical tool for predicting distillation performance from SAE measurements alone.

Nilesh Sarkar

2 Papers