CV AI LGApr 11, 2025

Steering CLIP's vision transformer with sparse autoencoders

Sonia Joseph, Praneet Suresh, Ethan Goldfarb, Lorenz Hufe, Yossi Gandelsman, Robert Graham, Danilo Bzdok, Wojciech Samek, Blake Aaron Richards

arXiv:2504.08729v126 citationsh-index: 16

Originality Incremental advance

AI Analysis

This work addresses the challenge of interpretability and control in vision models, offering incremental improvements for researchers and practitioners in computer vision and AI safety.

The authors tackled the problem of understanding and controlling CLIP's vision transformer by training sparse autoencoders (SAEs) on it, uncovering differences in sparsity patterns between vision and language processing and demonstrating that 10-15% of neurons and features are steerable, with SAEs providing thousands more steerable features than the base model. They achieved state-of-the-art performance on defense against typographic attacks and improved disentanglement on tasks like CelebA and Waterbirds.

While vision models are highly capable, their internal mechanisms remain poorly understood -- a challenge which sparse autoencoders (SAEs) have helped address in language, but which remains underexplored in vision. We address this gap by training SAEs on CLIP's vision transformer and uncover key differences between vision and language processing, including distinct sparsity patterns for SAEs trained across layers and token types. We then provide the first systematic analysis on the steerability of CLIP's vision transformer by introducing metrics to quantify how precisely SAE features can be steered to affect the model's output. We find that 10-15\% of neurons and features are steerable, with SAEs providing thousands more steerable features than the base model. Through targeted suppression of SAE features, we then demonstrate improved performance on three vision disentanglement tasks (CelebA, Waterbirds, and typographic attacks), finding optimal disentanglement in middle model layers, and achieving state-of-the-art performance on defense against typographic attacks.

View on arXiv PDF

Similar