LGAIJul 6, 2025

Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning

arXiv:2507.04250v116 citationsh-index: 6ICML
Originality Incremental advance
AI Analysis

This addresses a key usability issue for deployed AI assistants by mitigating over-refusal without compromising safety, though it is an incremental improvement on existing alignment techniques.

The paper tackles the problem of over-refusal in safety-aligned language models, where benign prompts are unnecessarily rejected, and introduces ACTOR, a targeted fine-tuning method that reduces over-refusals by 15-30% on benchmarks while maintaining safety and utility.

Safety alignment is crucial for large language models (LLMs) to resist malicious instructions but often results in over-refusals, where benign prompts are unnecessarily rejected, impairing user experience and model utility. We introduce ACTOR (Activation-Based Training for Over-Refusal Reduction), a robust and compute- and data-efficient training framework that minimizes over-refusals by leveraging internal activation patterns from diverse queries. ACTOR precisely identifies and adjusts the activation components that trigger refusals, providing stronger control over the refusal mechanism. By fine-tuning only a single model layer, ACTOR effectively reduces over-refusals across multiple benchmarks while maintaining the model's ability to handle harmful queries and preserve overall utility.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes