LGAIOct 26, 2025

Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts

arXiv:2511.00029v11 citations
Originality Incremental advance
AI Analysis

This addresses the safety-utility tradeoff in LLM deployment for developers and users, offering a targeted method to enhance refusal rates without compromising utility.

The paper tackled the problem of guiding LLMs to refuse unsafe prompts while answering safe ones by using feature-guided SAE steering with contrasting prompts, achieving an 18.9% improvement in safety performance and an 11.1% increase in utility on Llama-3 8B.

Large Language Model (LLM) deployment requires guiding the LLM to recognize and not answer unsafe prompts while complying with safe prompts. Previous methods for achieving this require adjusting model weights along with other expensive procedures. While recent advances in Sparse Autoencoders (SAEs) have enabled interpretable feature extraction from LLMs, existing approaches lack systematic feature selection methods and principled evaluation of safety-utility tradeoffs. We explored using different steering features and steering strengths using Sparse Auto Encoders (SAEs) to provide a solution. Using an accurate and innovative contrasting prompt method with the AI-Generated Prompts Dataset from teknium/OpenHermes-2p5-Mistral-7B and Air Bench eu-dataset to efficiently choose the best features in the model to steer, we tested this method on Llama-3 8B. We conclude that using this method, our approach achieves an 18.9% improvement in safety performance while simultaneously increasing utility by 11.1%, demonstrating that targeted SAE steering can overcome traditional safety-utility tradeoffs when optimal features are identified through principled selection methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes