CLAISEApr 10

Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation

arXiv:2605.2883090.5h-index: 6Has Code
Predicted impact top 30% in CL · last 90 daysOriginality Synthesis-oriented
AI Analysis

Provides practical guidance for practitioners selecting safety guard models for LLM deployment, revealing that smaller general-purpose models can outperform larger specialized ones.

Evaluated 14 open-source safety guard models on a 79,331-sample benchmark across 8 safety categories, finding that Qwen Guard (4B) achieves the highest recall (83.97%) while larger models like Llama Guard (12B) miss up to 75% of unsafe content, and model size does not correlate with safety detection performance.

As Large Language Models (LLMs) are increasingly deployed in safety-critical applications, robust content moderation becomes essential. We present a comprehensive evaluation of 14 open-source safety guard models on a curated benchmark of 79,331 samples spanning 8 NIST AI Risk Framework safety categories. Our benchmark aggregates four diverse datasets (HarmBench, StrongREJECT, RealToxicityPrompts, and BeaverTails), filtered to focus exclusively on safety-relevant content (violence, hate speech, harassment, sexual content, suicide/self-harm, profanity, threats, and health misinformation). We find that recall is the critical metric for safety applications, as missing harmful content poses greater risk than false positives. Our evaluation reveals surprising results: Qwen Guard (4B parameters) achieves the highest recall (83.97%) while larger models like Llama Guard (12B) and GPT-OSS Safeguard (20B) exhibit conservative behavior, missing up to 75% of unsafe content. We demonstrate that model size does not correlate with safety detection performance and that general-purpose guard models outperform specialized ones. These findings provide practical guidance for selecting safety guard models in production deployments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes