LGOct 11, 2023

Measuring Feature Sparsity in Language Models

arXiv:2310.07837v26.63 citationsh-index: 9

Originality Incremental advance

AI Analysis

This work provides tools for evaluating sparse coding techniques in language models, which is incremental for researchers in interpretability and model analysis.

The paper tackled the problem of assessing the validity of linearity and sparsity assumptions in language model activations by developing metrics to measure feature sparsity, finding that activations can be accurately modeled as sparse linear combinations of features, significantly more so than control datasets, with sparsest levels in the first and final layers.

Recent works have proposed that activations in language models can be modelled as sparse linear combinations of vectors corresponding to features of input text. Under this assumption, these works aimed to reconstruct feature directions using sparse coding. We develop metrics to assess the success of these sparse coding techniques and test the validity of the linearity and sparsity assumptions. We show our metrics can predict the level of sparsity on synthetic sparse linear activations, and can distinguish between sparse linear data and several other distributions. We use our metrics to measure levels of sparsity in several language models. We find evidence that language model activations can be accurately modelled by sparse linear combinations of features, significantly more so than control datasets. We also show that model activations appear to be sparsest in the first and final layers.

View on arXiv PDF

Similar