LGMay 28

Improving Selective Classification with Pairwise Queries for Binary Classification

Harsh Vardhan, Sunav Choudhary, Natwar Modani, Arya Mazumdar

arXiv:2605.3061510.7h-index: 8

AI Analysis

This work provides an incremental improvement for practitioners using selective classification with LLMs, by reducing error on samples the model does predict.

The paper addresses the problem of high error rates on non-rejected samples in selective classification, particularly in in-context binary classification by LLMs, where confidence estimates can be inconsistent with predictions. They propose using additional pairwise queries to the same model to detect high-error samples, which significantly improves the accuracy-cost tradeoff compared to using raw confidence estimates across 1 synthetic and 4 real binary classification datasets.

In selective classification, a model predicts the labels of data samples where it is confident, and abstains from predicting labels for samples on which it is not confident. The rejected samples are often labeled by an expert, which is expensive. The budget for the expert is best utilized when the model has low error on non-rejected samples. However, the estimate of a model's confidence might be inconsistent with the model's predictions, which can lead to high error on non-rejected points. Such situations can readily occur in in-context binary classification by LLMs. To remedy this, we propose making additional pairwise queries to the same model. These pairwise queries can detect high-error samples and be incorporated into selective classification techniques to reduce the error on non-rejected samples. Theoretically, we establish the conditions under which a simple algorithm using pairwise queries outperforms an inconsistent confidence estimate. We support this insight through extensive experiments for $1$ synthetic and $4$ in-context learning-based real binary classification datasets. In all these cases, we show that our algorithms, using pairwise queries, obtain a better accuracy-cost tradeoff than using only the raw confidence estimates, for instance, the LLM's next-token logits.

View on arXiv PDF

Similar