CLMay 12

Mechanistic Interpretability of ASR models using Sparse Autoencoders

Dan Pluth, Zachary Nicholas Houghton, Yu Zhou, Vijay K. Gurbani

arXiv:2605.1222577.8

Predicted impact top 75% in CL · last 90 daysOriginality Synthesis-oriented

AI Analysis

For researchers in mechanistic interpretability and ASR, this paper extends SAE-based interpretability from text LLMs to audio models, though it is an incremental application of an existing method to a new domain.

This work applies Sparse Autoencoders (SAE) to interpret Whisper, a Transformer-based ASR model, by training a sparse latent space on encoder embeddings. It uncovers monosemantic features across linguistic and non-linguistic boundaries and demonstrates cross-lingual feature steering, establishing the viability of SAE for audio models.

Understanding the internal machinations of deep Transformer-based NLP models is more crucial than ever as these models see widespread use in various domains that affect the public at large, such as industry, academia, finance, health. While these models have advanced rapidly, their internal mechanisms remain largely a mystery. Techniques such as Sparse Autoencoders (SAE) have emerged to understand these mechanisms by projecting dense representations into a sparse vector. While existing research has demonstrated the viability of the SAE in interpreting text-based Large Language Models (LLMs), there are no equivalent studies that demonstrate the application of a SAE to audio processing models like Automatic Speech Recognizers (ASRs). In this work, a SAE is applied to Whisper, a Transformer-based ASR, training a high-dimensional sparse latent space on frame-level embeddings extracted from the Whisper encoder. Our work uncovers diverse monosemantic features across linguistic and non-linguistic boundaries, and demonstrates cross-lingual feature steering. This work establishes the viability of a SAE model and demonstrates that Whisper encodes a rich amount of linguistic information.

View on arXiv PDF

Similar