Visual Exploration of Feature Relationships in Sparse Autoencoders with Curated Concepts
This work addresses the challenge of interpretability for researchers analyzing SAEs in LLMs, though it is incremental as it builds on existing visualization methods.
The paper tackles the problem of exploring the large number of features in sparse autoencoders (SAEs) for interpretability in large language models, proposing a focused visualization framework that prioritizes curated concepts to enable targeted analysis of feature relationships.
Sparse autoencoders (SAEs) have emerged as a powerful tool for uncovering interpretable features in large language models (LLMs) through the sparse directions they learn. However, the sheer number of extracted directions makes comprehensive exploration intractable. While conventional embedding techniques such as UMAP can reveal global structure, they suffer from limitations including high-dimensional compression artifacts, overplotting, and misleading neighborhood distortions. In this work, we propose a focused exploration framework that prioritizes curated concepts and their corresponding SAE features over attempts to visualize all available features simultaneously. We present an interactive visualization system that combines topology-based visual encoding with dimensionality reduction to faithfully represent both local and global relationships among selected features. This hybrid approach enables users to investigate SAE behavior through targeted, interpretable subsets, facilitating deeper and more nuanced analysis of concept representation in latent space.