Sparse Crosscoders for diffing MoEs and Dense models
This work addresses the interpretability gap for MoE models, which is important for researchers in machine learning, but it is incremental as it builds on existing sparse autoencoder methods.
The researchers tackled the problem of understanding the internal representations of Mixture of Experts (MoE) models compared to dense models by using crosscoders, achieving ~87% fractional variance explained and revealing that MoEs learn fewer unique features with more specialized representations.
Mixture of Experts (MoE) achieve parameter-efficient scaling through sparse expert routing, yet their internal representations remain poorly understood compared to dense models. We present a systematic comparison of MoE and dense model internals using crosscoders, a variant of sparse autoencoders, that jointly models multiple activation spaces. We train 5-layer dense and MoEs (equal active parameters) on 1B tokens across code, scientific text, and english stories. Using BatchTopK crosscoders with explicitly designated shared features, we achieve $\sim 87\%$ fractional variance explained and uncover concrete differences in feature organization. The MoE learns significantly fewer unique features compared to the dense model. MoE-specific features also exhibit higher activation density than shared features, whereas dense-specific features show lower density. Our analysis reveals that MoEs develop more specialized, focused representations while dense models distribute information across broader, more general-purpose features.