QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA
This addresses the challenge of making LLM alignment more transparent and effective for AI safety applications, representing a novel method rather than an incremental improvement.
The paper tackles the problem of aligning large language models with principles like helpfulness and safety by introducing QA-LIGN, which decomposes rewards into interpretable evaluations, resulting in up to a 68.7% reduction in attack success rates while maintaining a low false refusal rate of 0.67%.
Alignment of large language models (LLMs) with principles like helpfulness, honesty, and harmlessness typically relies on scalar rewards that obscure which objectives drive the training signal. We introduce QA-LIGN, which decomposes monolithic rewards into interpretable principle-specific evaluations through structured natural language programs. Models learn through a draft, critique, and revise pipeline, where symbolic evaluation against the rubrics provides transparent feedback for both initial and revised responses during GRPO training. Applied to uncensored Llama-3.1-8B-Instruct, QA-LIGN reduces attack success rates by up to 68.7% while maintaining a 0.67% false refusal rate, achieving Pareto optimal safety-helpfulness performance and outperforming both DPO and GRPO with state-of-the-art reward models given equivalent training. These results demonstrate that making reward signals interpretable and modular improves alignment effectiveness, suggesting transparency enhances LLM safety.