Entropy-Lens: The Information Signature of Transformer Computations
This work provides a model-agnostic interpretability tool for transformer models, offering insights into computation patterns and task performance, but it is incremental as it builds on existing interpretability methods by focusing on entropy-based signatures.
The authors tackled the problem of interpreting transformer models by analyzing token-level distributions in vocabulary space, introducing Entropy-Lens to compute entropy profiles that reveal computation patterns, predict prompt types, and correlate with output correctness across various transformers without model modifications.
Transformer models map input token sequences to output token distributions, layer by layer. While most interpretability work focuses on internal latent representations, we study the evolution of these token-level distributions directly in vocabulary space. However, such distributions are high-dimensional and defined on an unordered support, making common descriptors like moments or cumulants ill-suited. We address this by computing the Shannon entropy of each intermediate predicted distribution, yielding one interpretable scalar per layer. The resulting sequence, the entropy profile, serves as a compact, information-theoretic signature of the model's computation. We introduce Entropy-Lens, a model-agnostic framework that extracts entropy profiles from frozen, off-the-shelf transformers. We show that these profiles (i) reveal family-specific computation patterns invariant under depth rescaling, (ii) are predictive of prompt type and task format, and (iii) correlate with output correctness. We further show that Rényi entropies yield similar results within a broad range of $α$ values, justifying the use of Shannon entropy as a stable and principled summary. Our results hold across different transformers, without requiring gradients, fine-tuning, or access to model internals.