Resolving Ambiguity in Composed Image Retrieval via Calibrated Interaction
For practitioners of composed image retrieval, this work addresses the fundamental false-negative problem and enables ambiguity-aware retrieval without sacrificing performance on precise queries.
Current composed image retrieval (CIR) systems assume a single target per query, ignoring ambiguity. The authors reframe CIR as calibrated intent resolution using conformal prediction to return candidate sets with coverage guarantees, and an active learning policy to ask clarifying questions. Their method matches single-turn SOTA on precise queries and reduces interaction budget compared to conversational baselines, achieving the first valid coverage and calibration for CIR.
Composed image retrieval (CIR) searches a corpus with a reference image and a text describing how to modify it. Despite rapid progress from triplet-trained compositors to zero-shot and generative methods, essentially all systems share one assumption: that a query maps to a single target, scored by Recall@K against one annotation. We argue this is fundamentally at odds with the task. A query such as make it more formal does not name an image but a region of the corpus, and which member the user intends is genuinely underdetermined. This underspecification is the root of the well-known false-negative problem and leaves current models unable to tell a precise query from an ambiguous one. We reframe CIR as calibrated intent resolution under uncertainty: a retriever is wrapped in a conformal prediction layer that returns a candidate set with a coverage guarantee and whose size is a principled measure of ambiguity; when the set is large, an expected-information-gain policy asks the single most useful clarifying question, drawn from interpretable ambiguity axes, and the set contracts. We introduce AmbiCIR, a benchmark and human-validated user simulator that revive the dormant auxiliary and dialogue annotations of CIRR and extend the multiple-positive setting of CIRCO. Across open-domain and fashion benchmarks our method matches single-turn state of the art, confirming calibrated resolution is cost-free on precise queries, while reaching the intended target in a fraction of the interaction budget required by naive conversational baselines, and it is the first to report valid coverage and calibration for the task.