LGJan 27

Decomposing multimodal embedding spaces with group-sparse autoencoders

arXiv:2601.20028v1
Originality Incremental advance
AI Analysis

This work addresses the challenge of interpretability and control in cross-modal tasks for AI researchers, though it is incremental as it builds on existing sparse autoencoder methods.

The paper tackled the problem of decomposing multimodal embedding spaces with sparse autoencoders, which often learn unimodal features, by proposing a method using cross-modal random masking and group-sparse regularization. The result showed improved multimodal alignment, reduced dead neurons, and enhanced feature semanticity in CLIP and CLAP embeddings.

The Linear Representation Hypothesis asserts that the embeddings learned by neural networks can be understood as linear combinations of features corresponding to high-level concepts. Based on this ansatz, sparse autoencoders (SAEs) have recently become a popular method for decomposing embeddings into a sparse combination of linear directions, which have been shown empirically to often correspond to human-interpretable semantics. However, recent attempts to apply SAEs to multimodal embedding spaces (such as the popular CLIP embeddings for image/text data) have found that SAEs often learn "split dictionaries", where most of the learned sparse features are essentially unimodal, active only for data of a single modality. In this work, we study how to effectively adapt SAEs for the setting of multimodal embeddings while ensuring multimodal alignment. We first argue that the existence of a split dictionary decomposition on an aligned embedding space implies the existence of a non-split dictionary with improved modality alignment. Then, we propose a new SAE-based approach to multimodal embedding decomposition using cross-modal random masking and group-sparse regularization. We apply our method to popular embeddings for image/text (CLIP) and audio/text (CLAP) data and show that, compared to standard SAEs, our approach learns a more multimodal dictionary while reducing the number of dead neurons and improving feature semanticity. We finally demonstrate how this improvement in alignment of concepts between modalities can enable improvements in the interpretability and control of cross-modal tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes