CLFeb 25, 2025

Verdict: A Library for Scaling Judge-Time Compute

arXiv:2502.18018v27 citationsh-index: 3Has Code
Originality Incremental advance
AI Analysis

It addresses the problem of unreliable automated evaluation for researchers and practitioners, offering a scalable and interpretable framework, though it is incremental in building on existing judge-time compute methods.

The paper tackles the reliability issues of LLM-as-a-judge systems by introducing Verdict, an open-source library that scales judge-time compute to improve accuracy, reliability, and interpretability, achieving performance competitive with much larger models on tasks like content moderation and fact-checking.

The use of LLMs as automated judges ("LLM-as-a-judge") is now widespread, yet standard judges suffer from a multitude of reliability issues. To address these challenges, we introduce Verdict, an open-source library for scaling judge-time compute to enhance the accuracy, reliability, and interpretability of automated evaluators. Verdict leverages the composition of modular reasoning units (such as verification, debate, and aggregation) and increased inference-time compute to improve LLM judge quality. Across a variety of challenging tasks such as content moderation, fact-checking, and hallucination detection, Verdict judges achieves performance competitive with orders-of-magnitude larger fine-tuned judges, prompted judges, and reasoning models. Our framework establishes a foundation for scalable, interpretable, and reliable LLM-based evaluation systems for both researchers and practitioners.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes