James Fiedler

43.4LGMay 7

Bias and Uncertainty in LLM-as-a-Judge Estimation

James Fiedler

LLM-as-a-Judge evaluation has become a standard tool for assessing base model performance. However, characterizing performance via the naive estimator, i.e., raw judge outputs, is systematically biased. Recent work has proposed estimators to correct this bias, but their reliability depends critically on judge quality and, for model comparisons, on calibration stability. Sharing calibration across compared models is practically attractive but can introduce severe bias, including cases where the comparison estimate points in the wrong direction with high apparent confidence. We study these failure modes through analytical results, simulations over judge quality ($J$) and cross-model calibration instability ($ΔJ$), and a real-data MMLU-Pro case study with sign reversal. We propose $J$ and $ΔJ$ as diagnostics for when corrected estimates, especially shared-calibration comparisons, are likely unreliable, and provide reporting guidance for LaaJ evaluation.

LGAug 6, 2021

Simple Modifications to Improve Tabular Neural Networks

James Fiedler

There is growing interest in neural network architectures for tabular data. Many general-purpose tabular deep learning models have been introduced recently, with performance sometimes rivaling gradient boosted decision trees (GBDTs). These recent models draw inspiration from various sources, including GBDTs, factorization machines, and neural networks from other application domains. Previous tabular neural networks are also drawn upon, but are possibly under-considered, especially models associated with specific tabular problems. This paper focuses on several such models, and proposes modifications for improving their performance. When modified, these models are shown to be competitive with leading general-purpose tabular models, including GBDTs.

James Fiedler

2 Papers