LGOct 11, 2023

Measuring Feature Sparsity in Language Models

arXiv:2310.07837v23 citationsh-index: 9
Originality Incremental advance
AI Analysis

This work provides tools for evaluating sparse coding techniques in language models, which is incremental for researchers in interpretability and model analysis.

The paper tackled the problem of assessing the validity of linearity and sparsity assumptions in language model activations by developing metrics to measure feature sparsity, finding that activations can be accurately modeled as sparse linear combinations of features, significantly more so than control datasets, with sparsest levels in the first and final layers.

Recent works have proposed that activations in language models can be modelled as sparse linear combinations of vectors corresponding to features of input text. Under this assumption, these works aimed to reconstruct feature directions using sparse coding. We develop metrics to assess the success of these sparse coding techniques and test the validity of the linearity and sparsity assumptions. We show our metrics can predict the level of sparsity on synthetic sparse linear activations, and can distinguish between sparse linear data and several other distributions. We use our metrics to measure levels of sparsity in several language models. We find evidence that language model activations can be accurately modelled by sparse linear combinations of features, significantly more so than control datasets. We also show that model activations appear to be sparsest in the first and final layers.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes