CLLGJul 8, 2025

Exploring Task Performance with Interpretable Models via Sparse Auto-Encoders

arXiv:2507.06427v11 citationsh-index: 8
Originality Incremental advance
AI Analysis

This work addresses the problem of low trustworthiness and performance in LLMs for users in NLP applications, offering an incremental improvement through interpretable decomposition.

The paper tackles the black-box nature of Large Language Models by using sparse autoencoders to extract monosemantic features, which identifies model-internal misunderstandings and reformulates prompts to improve interpretation, resulting in significant performance gains in tasks like mathematical reasoning and metaphor detection.

Large Language Models (LLMs) are traditionally viewed as black-box algorithms, therefore reducing trustworthiness and obscuring potential approaches to increasing performance on downstream tasks. In this work, we apply an effective LLM decomposition method using a dictionary-learning approach with sparse autoencoders. This helps extract monosemantic features from polysemantic LLM neurons. Remarkably, our work identifies model-internal misunderstanding, allowing the automatic reformulation of the prompts with additional annotations to improve the interpretation by LLMs. Moreover, this approach demonstrates a significant performance improvement in downstream tasks, such as mathematical reasoning and metaphor detection.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes