LG CLOct 13, 2025

Don't Walk the Line: Boundary Guidance for Filtered Generation

arXiv:2510.11834v11 citationsh-index: 1

Originality Incremental advance

AI Analysis

This addresses safety filtering issues in generative AI, though it appears incremental as it builds on existing fine-tuning and reinforcement learning approaches.

The paper tackles the problem of generative models producing outputs near safety classifier boundaries when fine-tuned to avoid filtering, which increases false positives and false negatives. It proposes Boundary Guidance, a reinforcement learning method that steers generation away from classifier margins, improving safety and utility on jailbreak and ambiguous prompts as measured by LLM-as-a-Judge evaluations.

Generative models are increasingly paired with safety classifiers that filter harmful or undesirable outputs. A common strategy is to fine-tune the generator to reduce the probability of being filtered, but this can be suboptimal: it often pushes the model toward producing samples near the classifier's decision boundary, increasing both false positives and false negatives. We propose Boundary Guidance, a reinforcement learning fine-tuning method that explicitly steers generation away from the classifier's margin. On a benchmark of jailbreak and ambiguous prompts, Boundary Guidance improves both the safety and the utility of outputs, as judged by LLM-as-a-Judge evaluations. Comprehensive ablations across model scales and reward designs demonstrate the robustness of our approach.

View on arXiv PDF

Similar