DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment

James Wedgwood, Aashiq Muhamed, Mona T. Diab, Virginia Smith

arXiv:2603.2146193.5h-index: 8

Predicted impact top 5% in LG · last 90 daysOriginality Highly original

AI Analysis

This work addresses the computational inefficiency and lack of mechanistic visibility in preference alignment for large language models, offering a more data-efficient and interpretable alternative.

The paper tackles the problem of high computational cost and limited interpretability in preference alignment by introducing DSPA, an inference-time method that uses conditional-difference maps to steer sparse autoencoders without weight updates, achieving competitive performance on benchmarks like MT-Bench and AlpacaEval while reducing alignment-stage FLOPs by up to 4.47x.

Preference alignment is usually achieved by weight-updating training on preference data, which adds substantial alignment-stage compute and provides limited mechanistic visibility. We propose Dynamic SAE Steering for Preference Alignment (DSPA), an inference-time method that makes sparse autoencoder (SAE) steering prompt-conditional. From preference triples, DSPA computes a conditional-difference map linking prompt features to generation-control features; during decoding, it modifies only token-active latents, without base-model weight updates. Across Gemma-2-2B/9B and Qwen3-8B, DSPA improves MT-Bench and is competitive on AlpacaEval while preserving multiple-choice accuracy. Under restricted preference data, DSPA remains robust and can rival the two-stage RAHF-SCIT pipeline while requiring up to $4.47\times$ fewer alignment-stage FLOPs. Finally, we audit the SAE features DSPA modifies, finding that preference directions are dominated by discourse and stylistic signals, and provide theory clarifying the conditional-difference map estimate and when top-$k$ ablation is principled.

View on arXiv PDF

Similar