CLAICRLGFeb 22, 2025

A Generative Approach to LLM Harmfulness Mitigation with Red Flag Tokens

arXiv:2502.16366v41 citationsh-index: 31
Originality Incremental advance
AI Analysis

This addresses safety issues in LLMs for users and developers by providing a complementary, easier-to-evaluate method to mitigate harmful outputs, though it is incremental as it builds on existing safety techniques.

The paper tackles the brittleness and utility degradation of existing safety post-training methods for LLMs by introducing a special red flag token that the model learns to insert when harmful content is imminent, enabling explicit harmfulness learning with minimal impact on utility and leveraging generalization capabilities like in-context learning for self-correction.

Many safety post-training methods for large language models (LLMs) are designed to modify the model's behaviour from producing unsafe answers to issuing refusals. However, such distribution shifts are often brittle and degrade performance on desirable tasks. To address these pitfalls, we propose augmenting the model's vocabulary with a special red flag token, and training the model to insert this token whenever harmful content is generated or imminent. This approach enables the model to explicitly learn the concept of harmfulness in its representations, with minimal impact on utility due to the marginal change in the generated distribution of natural language. Moreover, because the token is embedded in the model's vocabulary, we can naturally leverage the LLMs' generalization capabilities, such as in-context learning (ICL) and out-of-distribution generalization to languages that are not formally supported (e.g., Japanese for Llama3). In particular, we demonstrate that through ICL alone, the model can learn to initiate reflective reasoning upon generating the red flag token at inference, which steers the response away from harmful continuations or enables self-correction when the flag is raised falsely. This approach is orthogonal and complementary to existing safety technique (such as safety classifiers or standard safety training) and easier to evaluate in comparison to natural language refusals, as it does not require a human or automated judge to assess the harmlessness of the answers.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes