CLApr 15

An Empirical Investigation of Practical LLM-as-a-Judge Improvement Techniques on RewardBench 2

arXiv:2604.1371714.61 citations
Predicted impact top 74% in CL · last 90 daysOriginality Synthesis-oriented
AI Analysis

For practitioners using LLM-as-a-judge, this provides simple, cost-effective methods to significantly boost judgment accuracy.

The paper investigates practical, drop-in techniques to improve GPT-5.4 judge accuracy on RewardBench 2 without finetuning. Task-specific criteria injection (+3.0pp) and ensemble scoring (+9.8pp) together achieve 83.6% accuracy, an 11.9pp improvement over the 71.7% baseline.

LLM-as-a-judge, using a language model to score or rank candidate responses, is widely used as a scalable alternative to human evaluation in RLHF pipelines, benchmarking, and application layer evaluations (evals). However, judgment reliability depends heavily on prompting and aggregation strategy. We present an empirical investigation of practical, drop-in techniques that improve GPT-5.4 judge accuracy on RewardBench 2 without any finetuning. Two techniques account for nearly all available gains: task-specific criteria injection (+3.0pp at negligible cost) and ensemble scoring (+9.8pp at 5x cost). Combined, they reach 83.6% accuracy, +11.9pp over the 71.7% baseline. Our investigation also covers three further techniques (calibration context, adaptive model escalation, and soft blending) which did not reliably improve on criteria + ensembling at comparable cost. Cheaper model tiers benefit disproportionately from ensembling: GPT-5.4 mini with k=8 achieves 79.2% at 1.2x baseline cost, and GPT-5.4 nano with k=8 reaches 71.4% at 0.4x baseline cost, making high-accuracy LLM judges accessible at low cost.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes