Pandora's Regret: A Proper Scoring Rule for Evaluating Sequential Search
For practitioners evaluating models in sequential search tasks (e.g., clinical diagnosis), this provides a decision-theoretically grounded scoring rule that aligns evaluation with actual search costs.
The paper identifies that standard proper scoring rules like log loss are misaligned with sequential search utility because they ignore ranking. It introduces Pandora's Regret, a new scoring rule that is pairwise-additive and strictly proper, and shows it better predicts clinical diagnostic costs across 597 MedMNIST models than standard metrics.
In sequential search, alternatives are tested until the true class is found. Standard proper scoring rules like log loss are local, ignoring the ranking of competitors and misaligning model evaluation with search utility. We show that sequential search induces a pairwise structure that overcomes this. By analyzing the expected cost of optimal search under varying testing costs, we derive Pandora's Regret: a closed-form, pairwise-additive, and strictly proper scoring rule. Pandora's Regret both elicits true probabilities and penalizes rank-reversing miscalibrations where distractors outrank the true class. Our construction yields a one-parameter Beta family that balances penalties for rank-swapping versus probability magnitude, while retaining a grounded interpretation as expected search cost. We prove that log loss, accuracy, and macro-F1 rely on implicit decision models misaligned with sequential search. Across 597 MedMNIST models, Pandora-based metrics better predict clinical diagnostic costs than standard alternatives, extending decision-theoretic scoring rule construction to the multiclass setting.