CL AIMar 11

Robust LLM Performance Certification via Constrained Maximum Likelihood Estimation

Minghe Shen, Ananth Balashankar, Adam Fisch, David Madras, Miguel Rodrigues

arXiv:2604.0325718.8h-index: 13

AI Analysis

This work addresses the challenge of certifying LLM performance for practitioners, offering a practical and scalable solution to reduce reliance on expensive human annotation, though it is incremental as it builds on existing statistical inference methods.

The paper tackles the problem of rigorously estimating failure rates of large language models (LLMs) for safe deployment by proposing a constrained maximum-likelihood estimation method that integrates human-labeled data, LLM-judge annotations, and domain constraints, resulting in more accurate and lower-variance estimates than state-of-the-art baselines across diverse experimental conditions.

The ability to rigorously estimate the failure rates of large language models (LLMs) is a prerequisite for their safe deployment. Currently, however, practitioners often face a tradeoff between expensive human gold standards and potentially severely-biased automatic annotation schemes such as "LLM-as-a-Judge" labeling. In this paper, we propose a new, practical, and efficient approach to LLM failure rate estimation based on constrained maximum-likelihood estimation (MLE). Our method integrates three distinct signal sources: (i) a small, high-quality human-labeled calibration set, (ii) a large corpus of LLM-judge annotations, and, most importantly, (iii) additional side information via domain-specific constraints derived from known bounds on judge performance statistics. We validate our approach through a comprehensive empirical study, benchmarking it against state-of-the-art baselines like Prediction-Powered Inference (PPI). Across diverse experimental regimes -- spanning varying judge accuracies, calibration set sizes, and LLM failure rates -- our constrained MLE consistently delivers more accurate and lower-variance estimates than existing methods. By moving beyond the "black-box" use of automated judges to a flexible framework, we provide a principled, interpretable, and scalable pathway towards LLM failure-rate certification.

View on arXiv PDF

Similar