CLAICRLGFeb 19, 2025

ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails

arXiv:2502.13458v226 citationsh-index: 44ACL
Originality Incremental advance
AI Analysis

This addresses safety issues for LLM deployments, representing a strong incremental advance in guardrail methods.

The paper tackled the problem of nuanced safety violations in large language models by proposing ThinkGuard, a critique-augmented guardrail model that uses structured critiques to enhance cautiousness and interpretability, resulting in a 16.1% accuracy improvement and 27.0% macro F1 gain over LLaMA Guard 3.

Ensuring the safety of large language models (LLMs) is critical as they are deployed in real-world applications. Existing guardrails rely on rule-based filtering or single-pass classification, limiting their ability to handle nuanced safety violations. To address this, we propose ThinkGuard, a critique-augmented guardrail model that distills knowledge from high-capacity LLMs by generating structured critiques alongside safety labels. Fine-tuned on critique-augmented data, the captured deliberative thinking ability drastically enhances the guardrail's cautiousness and interpretability. Evaluated on multiple safety benchmarks, ThinkGuard achieves the highest average F1 and AUPRC, outperforming all baselines. Compared to LLaMA Guard 3, ThinkGuard improves accuracy by 16.1% and macro F1 by 27.0%. Moreover, it surpasses label-only fine-tuned models, confirming that structured critiques enhance both classification precision and nuanced safety reasoning while maintaining computational efficiency.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes