CLFeb 6

Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought

Jianfeng Si, Lin Sun, Weihong Lin, Xiangzheng Zhang

arXiv:2602.06650v10.6h-index: 6

Originality Incremental advance

AI Analysis

It addresses the problem of inflexible safety policies in LLMs for real-world deployment, offering a more controllable and transparent approach, though it is incremental as it builds on existing safety alignment methods.

The paper tackles the safety-helpfulness trade-off in LLMs by introducing PACT, a framework for dynamic safety control through risk-aware reasoning, achieving near state-of-the-art safety performance and best controllability under user-specific policies.

Large Language Models (LLMs) face a fundamental safety-helpfulness trade-off due to static, one-size-fits-all safety policies that lack runtime controllabilityxf, making it difficult to tailor responses to diverse application needs. %As a result, models may over-refuse benign requests or under-constrain harmful ones. We present \textbf{PACT} (Prompt-configured Action via Chain-of-Thought), a framework for dynamic safety control through explicit, risk-aware reasoning. PACT operates under a hierarchical policy architecture: a non-overridable global safety policy establishes immutable boundaries for critical risks (e.g., child safety, violent extremism), while user-defined policies can introduce domain-specific (non-global) risk categories and specify label-to-action behaviors to improve utility in real-world deployment settings. The framework decomposes safety decisions into structured Classify$\rightarrow$Act paths that route queries to the appropriate action (comply, guide, or reject) and render the decision-making process transparent. Extensive experiments demonstrate that PACT achieves near state-of-the-art safety performance under global policy evaluation while attaining the best controllability under user-specific policy evaluation, effectively mitigating the safety-helpfulness trade-off. We will release the PACT model suite, training data, and evaluation protocols to facilitate reproducible research in controllable safety alignment.

View on arXiv PDF

Similar