CLAIMay 25

MiRD: Reliable Set-Valued Prediction for Open-Ended Question Answering via Miscoverage Risk Decomposition

arXiv:2605.2709183.7
Predicted impact top 56% in CL · last 90 daysOriginality Incremental advance
AI Analysis

For practitioners of open-ended QA systems, MiRD provides a principled way to mitigate hallucinations with guaranteed coverage, addressing a key limitation of existing conformal approaches that discard calibration examples without admissible candidates.

MiRD introduces a two-stage framework for reliable set-valued prediction in open-ended QA, decomposing miscoverage into sampling failure and conditional selection failure. It controls overall miscoverage across three datasets and eight models, yielding tighter bounds than PAC alternatives and more adaptive prediction sets than successful-only calibration.

Reliable set-valued prediction provides a principled way to mitigate hallucinations in open-ended question answering (QA), yet existing conformal approaches typically rely on a fragile premise: finite sampling must already produce at least one admissible candidate, or calibration examples violating this condition are discarded. In this paper, we introduce MiRD, a two-stage framework that decomposes overall miscoverage into sampling failure and conditional selection failure. In Stage I, MiRD establishes an expectation-level marginal upper bound on the probability that finite sampling produces no admissible answer under a fixed budget. In Stage II, conditioned on sampling success, MiRD calibrates a conformal selection threshold using admission-correlated nonconformity scores defined over the full calibration set, thereby preserving calibration-set integrity. Across three open-ended QA datasets and eight models, MiRD controls sampling risk, conditional selection risk, and overall miscoverage, while yielding tighter first-stage bounds than PAC-style alternatives and more adaptive prediction sets than successful-only calibration.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes