LG AIDec 2, 2025

Distribution-Calibrated Inference time compute for Thinking LLM-as-a-Judge

Hamid Dadkhahi, Firas Trabelsi, Parker Riley, Juraj Juraska, Mehdi Mirzazadeh

arXiv:2512.03019v19.42 citationsh-index: 12

Originality Incremental advance

AI Analysis

This work addresses inconsistency in LLM-as-a-judge evaluations for researchers and practitioners, though it is incremental as it builds on existing aggregation methods.

The paper tackled the problem of noisy pairwise preference judgments from thinking LLMs by proposing a distribution-calibrated inference-time compute aggregation scheme, which reduced MAE and increased pairwise accuracy across benchmarks, matching or exceeding human rater performance.

Thinking Large Language Models (LLMs) used as judges for pairwise preferences remain noisy at the single-sample level, and common aggregation rules (majority vote, soft self-consistency, or instruction-based self-aggregation) are inconsistent when ties are allowed. We study inference-time compute (ITC) for evaluators that generate n independent thinking-rating samples per item, and propose a principled, distribution-calibrated aggregation scheme. Our method models three-way preferences with a Bradley-Terry-Davidson formulation on rating counts, leveraging both polarity (margin among non-ties) and decisiveness (non-tie rate) to distinguish narrow margins from strong consensus. Across various evaluation benchmarks, our approach consistently reduces MAE and increases pairwise accuracy versus standard baselines, and when evaluated against human-consensus meta-labels, matches or exceeds individual human raters. These results show that carefully allocating ITC and aggregating with distribution-aware methods turns noisy individual model judgments into reliable ratings for evaluation.

View on arXiv PDF

Similar