Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning
This addresses a key usability issue for deployed AI assistants by mitigating over-refusal without compromising safety, though it is an incremental improvement on existing alignment techniques.
The paper tackles the problem of over-refusal in safety-aligned language models, where benign prompts are unnecessarily rejected, and introduces ACTOR, a targeted fine-tuning method that reduces over-refusals by 15-30% on benchmarks while maintaining safety and utility.
Safety alignment is crucial for large language models (LLMs) to resist malicious instructions but often results in over-refusals, where benign prompts are unnecessarily rejected, impairing user experience and model utility. We introduce ACTOR (Activation-Based Training for Over-Refusal Reduction), a robust and compute- and data-efficient training framework that minimizes over-refusals by leveraging internal activation patterns from diverse queries. ACTOR precisely identifies and adjusts the activation components that trigger refusals, providing stronger control over the refusal mechanism. By fine-tuning only a single model layer, ACTOR effectively reduces over-refusals across multiple benchmarks while maintaining the model's ability to handle harmful queries and preserve overall utility.