Taming Object Hallucinations with Verified Atomic Confidence Estimation
This addresses reliability issues in MLLMs for users in vision-language applications, though it is incremental as it builds on existing self-verification methods.
The paper tackles object hallucinations in Multimodal Large Language Models (MLLMs) by introducing TACO, a framework that uses self-verification and confidence calibration, resulting in consistent performance improvements across five benchmarks with models like LLaVA-1.5-7B and CogVLM2.
Multimodal Large Language Models (MLLMs) often suffer from hallucinations, particularly errors in object existence, attributes, or relations, which undermine their reliability. We introduce TACO (Verified Atomic Confidence Estimation), a simple framework that mitigates hallucinations through self-verification and confidence calibration without relying on external vision experts. TACO decomposes responses into atomic queries, paraphrases them to reduce sensitivity to wording, and estimates confidence using self-consistency (black-box) or self-confidence (gray-box) aggregation, before refining answers with a language model. Experiments on five benchmarks (POPE, MME, HallusionBench, AMBER, and MM-Hal Bench) with two MLLMs (\texttt{LLaVA-1.5-7B} and \texttt{CogVLM2}) show that TACO consistently outperforms direct prompting and Visual Contrastive Decoding, reduces systematic biases, and improves confidence calibration, demonstrating its effectiveness in enhancing the faithfulness of MLLMs.