CL AIMay 21, 2025

Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering

Haiyan Zhao, Xuansheng Wu, Fan Yang, Bo Shen, Ninghao Liu, Mengnan Du

arXiv:2505.15038v215.511 citationsh-index: 17

Originality Incremental advance

AI Analysis

This work addresses robustness issues in language model steering for AI safety and control applications, representing an incremental improvement over existing methods.

The paper tackled the problem of noisy features in linear concept vectors for steering language models, proposing Sparse Autoencoder-Denoised Concept Vectors (SDCV) that improved steering success rates by 4-16% across six challenging concepts.

Linear concept vectors effectively steer LLMs, but existing methods suffer from noisy features in diverse datasets that undermine steering robustness. We propose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which selectively keep the most discriminative SAE latents while reconstructing hidden representations. Our key insight is that concept-relevant signals can be explicitly separated from dataset noise by scaling up activations of top-k latents that best differentiate positive and negative samples. Applied to linear probing and difference-in-mean, SDCV consistently improves steering success rates by 4-16\% across six challenging concepts, while maintaining topic relevance.

View on arXiv PDF

Similar