Atlas-Alignment: Making Interpretability Transferable Across Language Models
This addresses the problem of making language models transparent and controllable for developers and researchers, offering a scalable solution that amortizes interpretability costs, though it is incremental as it builds on existing alignment techniques.
The paper tackles the high cost and scalability issues of interpretability in language models by introducing Atlas-Alignment, a framework that transfers interpretability across models using lightweight alignment to a pre-labeled Concept Atlas, enabling semantic feature search and steerable generation without additional labeled data.
Interpretability is crucial for building safe, reliable, and controllable language models, yet existing interpretability pipelines remain costly and difficult to scale. Interpreting a new model typically requires costly training of model-specific sparse autoencoders, manual or semi-automated labeling of SAE components, and their subsequent validation. We introduce Atlas-Alignment, a framework for transferring interpretability across language models by aligning unknown latent spaces to a Concept Atlas - a labeled, human-interpretable latent space - using only shared inputs and lightweight representational alignment techniques. Once aligned, this enables two key capabilities in previously opaque models: (1) semantic feature search and retrieval, and (2) steering generation along human-interpretable atlas concepts. Through quantitative and qualitative evaluations, we show that simple representational alignment methods enable robust semantic retrieval and steerable generation without the need for labeled concept data. Atlas-Alignment thus amortizes the cost of explainable AI and mechanistic interpretability: by investing in one high-quality Concept Atlas, we can make many new models transparent and controllable at minimal marginal cost.