LGAICLSep 1, 2025

CARE: Decoding Time Safety Alignment via Rollback and Introspection Intervention

arXiv:2509.06982v14 citationsh-index: 17
Originality Incremental advance
AI Analysis

This addresses the critical challenge of safety alignment for LLMs in real-world applications, representing an incremental improvement over existing decoding-time interventions.

The paper tackles the problem of ensuring safety in large language model outputs during decoding without compromising response quality, proposing the CARE framework which integrates real-time monitoring, rollback, and introspection to achieve a low harmful response rate and minimal disruption while maintaining high quality.

As large language models (LLMs) are increasingly deployed in real-world applications, ensuring the safety of their outputs during decoding has become a critical challenge. However, existing decoding-time interventions, such as Contrastive Decoding, often force a severe trade-off between safety and response quality. In this work, we propose CARE, a novel framework for decoding-time safety alignment that integrates three key components: (1) a guard model for real-time safety monitoring, enabling detection of potentially unsafe content; (2) a rollback mechanism with a token buffer to correct unsafe outputs efficiently at an earlier stage without disrupting the user experience; and (3) a novel introspection-based intervention strategy, where the model generates self-reflective critiques of its previous outputs and incorporates these reflections into the context to guide subsequent decoding steps. The framework achieves a superior safety-quality trade-off by using its guard model for precise interventions, its rollback mechanism for timely corrections, and our novel introspection method for effective self-correction. Experimental results demonstrate that our framework achieves a superior balance of safety, quality, and efficiency, attaining a low harmful response rate and minimal disruption to the user experience while maintaining high response quality.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes