TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

arXiv:2606.0745112.8
Originality Incremental advance
AI Analysis

For practitioners using CLIP models, TEVI offers a method to improve alignment and retrieval performance without retraining, with gains on both coarse- and fine-grained benchmarks.

TEVI uses sparse autoencoders and a masking module to selectively retain caption-described information from CLIP image embeddings, improving vision-language alignment. It achieves improved retrieval performance on MS COCO, Flickr, IIW, DOCCI, and RoCOCO benchmarks, with stronger gains on richer captions.

Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance. Recent work has shown that this can be attributed to an information imbalance: images contain more information than their captions describe. In this work, we propose TEVI, a framework that uses captions as a signal for what to retain from image embeddings. Specifically, we use sparse autoencoders to disentangle image embeddings and train a masking module to selectively reconstruct the embedding based on a given caption. In a controlled setup with synthetic captions, we show that TEVI is effective at preserving caption-described attributes while discarding others. By applying TEVI to CLIP models trained on natural images, we further achieve improved retrieval performance across coarse-grained short-caption (MS COCO, Flickr) and fine-grained long-caption (IIW, DOCCI) benchmarks, with stronger gains on richer captions, and improved robustness on the RoCOCO benchmark.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes