A Finite-Calibration Regime Map for LLM Judge Panels
For practitioners deploying LLM judge panels, the paper provides a practical regime map to decide calibration strategy under limited human labels, showing that the key question is whether the next judge's information is estimable.
The paper studies when to calibrate LLM judge panels with low-dimensional stackers versus joint output tables under finite human-label budgets, finding that scalar/reliability aggregation wins 16 of 20 real dataset–budget cells, indicating current judge outputs are often additive or redundant. Controlled experiments show additive labels favor scalar methods, while a six-way interaction selects a larger joint table with test MSE dropping from 0.224 to 0.061.
We study when LLM judge panels should be calibrated with low-dimensional stackers versus joint output tables under finite human-label budgets. Low-dimensional stackers have small estimation cost but miss interactions, whereas joint-table calibrators can represent interactions but pay for cell counts and unseen patterns. We cast this tradeoff as a finite-calibration regime map and instantiate it as Finite-Calibration Panel Selection, a deployable validation selector over judge path, prefix size, and aggregator family with table and parametric estimation diagnostics. On RewardBench, LLMBar, SummEval, and Arena100K with a seven-judge pool including DeepSeek V4 Flash, scalar/reliability aggregation wins 16 of 20 real dataset--budget cells, indicating that current judge outputs are often additive or redundant. Controlled calibration-growth data show the complementary regime: additive labels remain scalar-favored, whereas a six-way interaction selects a larger joint table and its test MSE drops from 0.224 to 0.061 once unseen mass vanishes. Thus the practical question is not ``how many judges?'' but whether the next judge's information is estimable under the available human labels.