Who judges the judges? Governance from metrics: a runtime framework for continuous LLM compliance monitoring

arXiv:2605.2473713.9Has Code

Predicted impact top 58% in CL · last 90 daysOriginality Synthesis-oriented

AI Analysis

For AI governance researchers and practitioners, this work addresses the need for continuous compliance monitoring in production LLM systems, though it is an incremental step with limited validation.

The paper argues that current AI compliance approaches are inadequate for the EU AI Act's requirements and proposes 'governance from metrics', a runtime framework using a panel of LLM evaluators to continuously monitor compliance. They validate with 49 annotated pairs, achieving agreement rates from 51.5% to 69.1% across small models, and identify failure modes and position bias degrading agreement by up to 25 percentage points.

Current approaches to AI compliance treat conformity as a binary, audit-time verdict rather than a continuous, measurable property of production systems. We argue that this compliance fiction is structurally ill-suited to the requirements of the EU AI Act, which demands ongoing human oversight and the detection of emergent behavioural drift in deployed systems. We introduce governance from metrics, a principle whereby regulatory compliance is derived as a continuous signal from runtime observability rather than from static assessments. Building on this principle, we present govllm, an open-source framework implementing a governance-driven routing architecture in which model selection is determined by accumulated compliance scores rather than by latency or cost alone. Central to our approach is a panel of regulatory judges - LLM evaluators specialised per criterion (EU AI Act, GDPR, ANSSI, accessibility) - whose inter-judge disagreement we reframe not as noise but as a regulatory uncertainty signal warranting human arbitration. We validate this approach through a ground truth corpus of 49 annotated prompt/response pairs across five regulatory criteria, evaluated by four small language models (SLMs, 1.7B-7B parameters) running fully on-premise. Agreement rates range from 51.5% (mistral:7b) to 69.1% (phi4-mini), with no single model dominating across all criteria - empirically motivating the Profile-as-jury design. We further document three structural failure modes in small regulatory judges and a judge-specific position bias that degrades agreement by up to 25 percentage points across three question-order conditions (original, reversed, permuted). govllm is released as open-source software to support reproducible AI governance research.

View on arXiv PDF

Similar