CL AIOct 9, 2025

Interpreting LLM-as-a-Judge Policies via Verifiable Global Explanations

Jasmina Gajcin, Erik Miehling, Rahul Nair, Elizabeth Daly, Radu Marinescu, Seshu Tirupathi

arXiv:2510.08120v16.72 citationsh-index: 27

Originality Incremental advance

AI Analysis

This work addresses the need for interpretability in LLM-based evaluation systems, which are increasingly used at scale, but it is incremental as it builds on existing explanation methods.

The authors tackled the problem of understanding biases and risks in LLM-as-a-Judge systems by proposing an approach to extract high-level concept-based global policies, achieving high faithfulness to LLM decisions on seven standard benchmarking datasets for content harm detection.

Using LLMs to evaluate text, that is, LLM-as-a-judge, is increasingly being used at scale to augment or even replace human annotations. As such, it is imperative that we understand the potential biases and risks of doing so. In this work, we propose an approach for extracting high-level concept-based global policies from LLM-as-a-Judge. Our approach consists of two algorithms: 1) CLoVE (Contrastive Local Verifiable Explanations), which generates verifiable, concept-based, contrastive local explanations and 2) GloVE (Global Verifiable Explanations), which uses iterative clustering, summarization and verification to condense local rules into a global policy. We evaluate GloVE on seven standard benchmarking datasets for content harm detection. We find that the extracted global policies are highly faithful to decisions of the LLM-as-a-Judge. Additionally, we evaluated the robustness of global policies to text perturbations and adversarial attacks. Finally, we conducted a user study to evaluate user understanding and satisfaction with global policies.

View on arXiv PDF

Similar