CVDec 20, 2024

VORD: Visual Ordinal Calibration for Mitigating Object Hallucinations in Large Vision-Language Models

arXiv:2412.15739v12 citationsh-index: 3
Originality Incremental advance
AI Analysis

This addresses a critical reliability issue for users deploying LVLMs in applications requiring accurate visual-language understanding, though it is an incremental improvement over existing calibration techniques.

The paper tackles the problem of object hallucinations in Large Vision-Language Models (LVLMs), where models generate plausible but inaccurate information, and presents VORD, a method that uses ordinal relationships between modified image pairs to calibrate token predictions, resulting in better calibration and effective mitigation of hallucinations across multiple benchmarks.

Large Vision-Language Models (LVLMs) have made remarkable developments along with the recent surge of large language models. Despite their advancements, LVLMs have a tendency to generate plausible yet inaccurate or inconsistent information based on the provided source content. This phenomenon, also known as ``hallucinations" can have serious downstream implications during the deployment of LVLMs. To address this, we present VORD a simple and effective method that alleviates hallucinations by calibrating token predictions based on ordinal relationships between modified image pairs. VORD is presented in two forms: 1.) a minimalist training-free variant which eliminates implausible tokens from modified image pairs, and 2.) a trainable objective function that penalizes unlikely tokens. Our experiments demonstrate that VORD delivers better calibration and effectively mitigates object hallucinations on a wide-range of LVLM benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes