Noise-Response Calibration: A Causal Intervention Protocol for LLM-Judges
This addresses the challenge of deploying stochastic LLM judges in low-label settings by providing a calibration method, though it is incremental as it builds on existing intervention ideas.
The paper tackled the problem of calibrating LLM judges by proposing a noise-response protocol to test if performance degrades predictably with increased noise, revealing a modality gap where text judges degrade as expected but many tabular datasets do not show significant deterioration.
Large language models (LLMs) are increasingly used as automated judges and synthetic labelers, especially in low-label settings. Yet these systems are stochastic and often overconfident, which makes deployment decisions difficult when external ground truth is limited. We propose a practical calibration protocol based on controlled input interventions: if noise severity increases, task performance should exhibit a statistically significant deterioration trend. We operationalize this with a slope-based hypothesis test over repeated trials, using signal-to-noise-ratio (SNR) perturbations for tabular data and lexical perturbations for text data. Across UCI tabular benchmarks and four text classification datasets, we find clear modality-dependent behavior. Our results reveal a modality gap: while text-based judges degrade predictably, the majority of tabular datasets show a lack of statistically significant performance deterioration even under significant signal-to-noise reduction. Interestingly we find that model performance is lower on datasets that are insensitive to noise interventions. We present a reproducible methodology and reporting protocol for robust LLM-judge calibration under distribution shift.