Steering Over-refusals Towards Safety in Retrieval Augmented Generation
This addresses safety alignment issues in RAG for domains like medical and chemical, though it is incremental as it builds on existing mitigation techniques.
The paper tackled the problem of over-refusals in retrieval-augmented generation (RAG) systems, where large language models decline benign requests due to aggressive safety filters, by introducing SafeRAG-Steering, an embedding intervention that reduces over-refusals while preserving legitimate refusals.
Safety alignment in large language models (LLMs) induces over-refusals -- where LLMs decline benign requests due to aggressive safety filters. We analyze this phenomenon in retrieval-augmented generation (RAG), where both the query intent and retrieved context properties influence refusal behavior. We construct RagRefuse, a domain-stratified benchmark spanning medical, chemical, and open domains, pairing benign and harmful queries with controlled context contamination patterns and sizes. Our analysis shows that context arrangement / contamination, domain of query and context, and harmful-text density trigger refusals even on benign queries, with effects depending on model-specific alignment choices. To mitigate over-refusals, we introduce \textsc{SafeRAG-Steering}, a model-centric embedding intervention that steers the embedding regions towards the confirmed safe, non-refusing output regions at inference time. This reduces over-refusals in contaminated RAG pipelines while preserving legitimate refusals.