CL LGJul 8, 2025

Exploring Task Performance with Interpretable Models via Sparse Auto-Encoders

Shun Wang, Tyler Loakman, Youbo Lei, Yi Liu, Bohao Yang, Yuting Zhao, Dong Yang, Chenghua Lin

arXiv:2507.06427v14.91 citationsh-index: 8

Originality Incremental advance

AI Analysis

This work addresses the problem of low trustworthiness and performance in LLMs for users in NLP applications, offering an incremental improvement through interpretable decomposition.

The paper tackles the black-box nature of Large Language Models by using sparse autoencoders to extract monosemantic features, which identifies model-internal misunderstandings and reformulates prompts to improve interpretation, resulting in significant performance gains in tasks like mathematical reasoning and metaphor detection.

Large Language Models (LLMs) are traditionally viewed as black-box algorithms, therefore reducing trustworthiness and obscuring potential approaches to increasing performance on downstream tasks. In this work, we apply an effective LLM decomposition method using a dictionary-learning approach with sparse autoencoders. This helps extract monosemantic features from polysemantic LLM neurons. Remarkably, our work identifies model-internal misunderstanding, allowing the automatic reformulation of the prompts with additional annotations to improve the interpretation by LLMs. Moreover, this approach demonstrates a significant performance improvement in downstream tasks, such as mathematical reasoning and metaphor detection.

View on arXiv PDF

Similar