CLJun 9, 2025

QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA

Jacob Dineen, Aswin RRV, Qin Liu, Zhikun Xu, Xiao Ye, Ming Shen, Zhaonan Li, Shijie Lu, Chitta Baral, Muhao Chen, Ben Zhou

arXiv:2506.08123v419.417 citationsh-index: 46EMNLP

Originality Highly original

AI Analysis

This addresses the challenge of making LLM alignment more transparent and effective for AI safety applications, representing a novel method rather than an incremental improvement.

The paper tackles the problem of aligning large language models with principles like helpfulness and safety by introducing QA-LIGN, which decomposes rewards into interpretable evaluations, resulting in up to a 68.7% reduction in attack success rates while maintaining a low false refusal rate of 0.67%.

Alignment of large language models (LLMs) with principles like helpfulness, honesty, and harmlessness typically relies on scalar rewards that obscure which objectives drive the training signal. We introduce QA-LIGN, which decomposes monolithic rewards into interpretable principle-specific evaluations through structured natural language programs. Models learn through a draft, critique, and revise pipeline, where symbolic evaluation against the rubrics provides transparent feedback for both initial and revised responses during GRPO training. Applied to uncensored Llama-3.1-8B-Instruct, QA-LIGN reduces attack success rates by up to 68.7% while maintaining a 0.67% false refusal rate, achieving Pareto optimal safety-helpfulness performance and outperforming both DPO and GRPO with state-of-the-art reward models given equivalent training. These results demonstrate that making reward signals interpretable and modular improves alignment effectiveness, suggesting transparency enhances LLM safety.

View on arXiv PDF

Similar