CLMay 9

Two Ways to De-Bias an LLM-as-a-Judge: A Continuous-Score Comparison of Hierarchical Bayesian Calibration and Neural-ODE Score Transport

arXiv:2605.092275.41 citations
AI Analysis

For practitioners using LLM-as-a-judge, this provides a data-budget-based decision rule for choosing between two calibration methods.

The paper compares a parametric hierarchical Bayesian linear corrector and a non-parametric Neural-ODE flow for debiasing LLM-as-a-judge scores on UltraFeedback, finding that the linear corrector is better with few anchors (100) while the flow wins with many (1500).

[Abridged] Using a Large Language Model (LLM) as an automatic rater (LLM-as-a-judge) is cheap but potentially biased: some judges run lenient, others strict, the middle of the scale gets compressed, and verbose answers may be over-rewarded. A common remedy is post-hoc calibration: leave the cheap judge in place and, on a modest set of paired anchors, fit a transformation from raw judge scores to an estimate of the human rating. We compare two correctors that take opposing views on how this mapping should be modeled: a parametric, small-anchor hierarchical Bayesian linear correction with per-score uncertainty, and a non-parametric Neural-ODE (FFJORD) score-transport flow. Both are run head-to-head on UltraFeedback fine-grained_score (1700 paired examples, 200 held out), with calibration split into three operational sub-questions: population-mean recovery, per-item accuracy, and distributional-shape match. The headline result is that the choice between methods is primarily a data-budget question. Both correctors close the raw $+0.71$-point mean offset to within $\pm 0.08$ of the GPT-4 reference, at 100 and at 1500 anchors. Past that, the methods swap roles. With 100 anchors, the linear corrector reconstructs the human-score distribution roughly twice as well by KL divergence (0.031 vs. 0.058) and ties the flow on MAE. With 1500 anchors the flow wins on every metric (MAE 0.320 vs. 0.359, Pearson 0.922 vs. 0.896, KL 0.026 vs. 0.037). The Bayesian linear corrector saturates well below 1500 anchors: residual $\tanh$-shaped non-linearity is, by construction, structure a linear correction cannot fit. The flow keeps improving as labels grow. We translate these findings into an explicit decision rule for production deployments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes