LGAIJun 9, 2025

InverseScope: Scalable Activation Inversion for Interpreting Large Language Models

arXiv:2506.07406v21 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work addresses interpretability for researchers and practitioners working with large language models, offering an incremental improvement over existing methods.

The paper tackles the challenge of interpreting internal representations in large language models by introducing InverseScope, a scalable framework for activation inversion that improves sample efficiency and enables systematic analysis, achieving a feature consistency rate for quantitative evaluation.

Understanding the internal representations of large language models (LLMs) is a central challenge in interpretability research. Existing feature interpretability methods often rely on strong assumptions about the structure of representations that may not hold in practice. In this work, we introduce InverseScope, an assumption-light and scalable framework for interpreting neural activations via input inversion. Given a target activation, we define a distribution over inputs that generate similar activations and analyze this distribution to infer the encoded information. To address the inefficiency of sampling in high-dimensional spaces, we propose a novel conditional generation architecture that significantly improves sample efficiency compared to previous method. We further introduce a quantitative evaluation protocol that tests interpretability hypotheses using the feature consistency rate computed over the sampled inputs. InverseScope scales inversion-based interpretability methods to larger models and practical tasks, enabling systematic and quantitative analysis of internal representations in real-world LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes