CLAILGOct 30, 2025

Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

arXiv:2511.05541v17 citationsh-index: 31
Originality Incremental advance
AI Analysis

This work addresses interpretability in language models for researchers and practitioners by improving feature discovery, though it is incremental as it builds on existing SAE methods with a novel loss.

The paper tackled the problem of sparse autoencoders (SAEs) capturing shallow, token-specific features instead of meaningful concepts in language models by proposing Temporal Sparse Autoencoders (T-SAEs) that incorporate a contrastive loss to encourage consistent feature activations over adjacent tokens, resulting in smoother, more coherent semantic concepts across multiple datasets and models without sacrificing reconstruction quality.

Translating the internal representations and computations of models into concepts that humans can understand is a key goal of interpretability. While recent dictionary learning methods such as Sparse Autoencoders (SAEs) provide a promising route to discover human-interpretable features, they suffer from a variety of problems, including a systematic failure to capture the rich conceptual information that drives linguistic understanding. Instead, they exhibit a bias towards shallow, token-specific, or noisy features, such as "the phrase 'The' at the start of sentences". In this work, we propose that this is due to a fundamental issue with how dictionary learning methods for LLMs are trained. Language itself has a rich, well-studied structure spanning syntax, semantics, and pragmatics; however, current unsupervised methods largely ignore this linguistic knowledge, leading to poor feature discovery that favors superficial patterns over meaningful concepts. We focus on a simple but important aspect of language: semantic content has long-range dependencies and tends to be smooth over a sequence, whereas syntactic information is much more local. Building on this insight, we introduce Temporal Sparse Autoencoders (T-SAEs), which incorporate a novel contrastive loss encouraging consistent activations of high-level features over adjacent tokens. This simple yet powerful modification enables SAEs to disentangle semantic from syntactic features in a self-supervised manner. Across multiple datasets and models, T-SAEs recover smoother, more coherent semantic concepts without sacrificing reconstruction quality. Strikingly, they exhibit clear semantic structure despite being trained without explicit semantic signal, offering a new pathway for unsupervised interpretability in language models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes