LG AI CLJun 24, 2025

Cross-Layer Discrete Concept Discovery for Interpreting Language Models

Ankur Garg, Xuemin Yu, Hassan Sajjad, Samira Ebrahimi Kahou

arXiv:2506.20040v27.11 citationsh-index: 26

Originality Incremental advance

AI Analysis

This work addresses the problem of interpreting language models for researchers and practitioners by providing a method to uncover cross-layer concepts, though it appears incremental as it builds on existing VQ-VAE techniques with specific enhancements.

The paper tackles the challenge of interpreting emergent concepts across transformer layers by addressing the linear mixing and duplication of information in the residual stream, which obscures feature evolution. It proposes CLVQ-VAE, a framework that uses vector quantization to collapse duplicated features into compact, interpretable concept vectors, achieving improved interpretability as demonstrated through qualitative analysis.

Uncovering emergent concepts across transformer layers remains a significant challenge because the residual stream linearly mixes and duplicates information, obscuring how features evolve within large language models. Current research efforts primarily inspect neural representations at single layers, thereby overlooking this cross-layer superposition and the redundancy it introduces. These representations are typically either analyzed directly for activation patterns or passed to probing classifiers that map them to a limited set of predefined concepts. To address these limitations, we propose cross-layer VQ-VAE (CLVQ-VAE), a framework that uses vector quantization to map representations across layers and in the process collapse duplicated residual-stream features into compact, interpretable concept vectors. Our approach uniquely combines top-k temperature-based sampling during quantization with EMA codebook updates, providing controlled exploration of the discrete latent space while maintaining code-book diversity. We further enhance the framework with scaled-spherical k-means++ for codebook initialization, which clusters by directional similarity rather than magnitude, better aligning with semantic structure in word embedding space.

View on arXiv PDF

Similar