Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
This addresses the issue of safety and usability in LLMs for vendors and users, offering an incremental improvement over existing safety mechanisms.
The paper tackles the problem of hard-gated safety checkers causing over-refusal and misalignment in LLMs by introducing Guardian-as-an-Advisor (GaaA), a soft-gating pipeline that uses a guardian to predict risk labels and explanations, resulting in improved response quality with only 2-10% end-to-end overhead and below 5% additional compute.
Hard-gated safety checkers often over-refuse and misalign with a vendor's model spec; prevailing taxonomies also neglect robustness and honesty, yielding safer-on-paper yet less useful systems. This work introduces Guardian-as-an-Advisor (GaaA), a soft-gating pipeline where a guardian predicts a binary risk label plus a concise explanation and prepends this advice to the original query for re-inference, keeping the base model operating under its original spec. To support training and evaluation, GuardSet is constructed, a 208k+ multi-domain dataset unifying harmful and harmless cases with targeted robustness and honesty slices. GuardAdvisor is trained via SFT followed by RL to enforce label-explanation consistency. GuardAdvisor attains competitive detection accuracy while enabling the advisory workflow; when used to augment inputs, responses improve over unaugmented prompts. A latency study shows advisor inference uses below 5% of base-model compute and adds only 2-10% end-to-end overhead under realistic harmful-input rates. Overall, GaaA steers models to comply with the model spec, maintaining safety while reducing over-refusal.