Bayesian Prediction-Powered Inference
This work addresses the challenge of efficient data labeling for researchers and practitioners in machine learning, offering incremental improvements to existing prediction-powered inference methods.
The paper tackles the problem of improving statistical estimates from limited human-labeled data by proposing a Bayesian framework for prediction-powered inference, which enables tighter confidence intervals by combining small human-labeled datasets with larger, potentially biased automatic system outputs, and demonstrates improved methods for cases like discrete-response autoraters and non-linear relationships.
Prediction-powered inference (PPI) is a method that improves statistical estimates based on limited human-labeled data. Specifically, PPI methods provide tighter confidence intervals by combining small amounts of human-labeled data with larger amounts of data labeled by a reasonably accurate, but potentially biased, automatic system. We propose a framework for PPI based on Bayesian inference that allows researchers to develop new task-appropriate PPI methods easily. Exploiting the ease with which we can design new metrics, we propose improved PPI methods for several importantcases, such as autoraters that give discrete responses (e.g., prompted LLM ``judges'') and autoraters with scores that have a non-linear relationship to human scores.