CLApr 14

CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems

arXiv:2604.1231283.3h-index: 13
Predicted impact top 58% in CL · last 90 daysOriginality Incremental advance
AI Analysis

For enterprises deploying LLM agents, this work provides a scalable method to create training data and benchmarks for compliance violation detection, addressing a critical gap in reliable automated evaluation.

The paper introduces CompliBench, a benchmark for evaluating LLM judges' ability to detect compliance violations in multi-turn dialogues. Using an automated data generation pipeline, they show that current proprietary LLMs struggle, while a fine-tuned small model outperforms them and generalizes to unseen domains.

As Large Language Models (LLMs) are increasingly deployed as task-oriented agents in enterprise environments, ensuring their strict adherence to complex, domain-specific operational guidelines is critical. While utilizing an LLM-as-a-Judge is a promising solution for scalable evaluation, the reliability of these judges in detecting specific policy violations remains largely unexplored. This gap is primarily due to the lack of a systematic data generation method, which has been hindered by the extensive cost of fine-grained human annotation and the difficulty of synthesizing realistic agent violations. In this paper, we introduce CompliBench, a novel benchmark designed to evaluate the ability of LLM judges to detect and localize guideline violations in multi-turn dialogues. To overcome data scarcity, we develop a scalable, automated data generation pipeline that simulates user-agent interactions. Our controllable flaw injection process automatically yields precise ground-truth labels for the violated guideline and the exact conversation turn, while an adversarial search method ensures these introduced perturbations are highly challenging. Our comprehensive evaluation reveals that current state-of-the-art proprietary LLMs struggle significantly with this task. In addition, we demonstrate that a small-scale judge model fine-tuned on our synthesized data outperforms leading LLMs and generalizes well to unseen business domains, highlighting our pipeline as an effective foundation for training robust generative reward models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes