CVAICLJun 16, 2025

ZINA: Multimodal Fine-grained Hallucination Detection and Editing

arXiv:2506.13130v12 citationsh-index: 14
Originality Incremental advance
AI Analysis

This addresses the issue of hallucinations in MLLMs for users needing reliable multimodal outputs, but it is incremental as it builds on existing detection tasks with a more detailed approach.

The paper tackles the problem of multimodal large language models generating hallucinations by proposing a fine-grained detection and editing method called ZINA, which outperforms existing models like GPT-4o and LLama-3.2 on a new dataset of 6.9k manually annotated and 20k synthetic samples.

Multimodal Large Language Models (MLLMs) often generate hallucinations, where the output deviates from the visual content. Given that these hallucinations can take diverse forms, detecting hallucinations at a fine-grained level is essential for comprehensive evaluation and analysis. To this end, we propose a novel task of multimodal fine-grained hallucination detection and editing for MLLMs. Moreover, we propose ZINA, a novel method that identifies hallucinated spans at a fine-grained level, classifies their error types into six categories, and suggests appropriate refinements. To train and evaluate models for this task, we constructed VisionHall, a dataset comprising 6.9k outputs from twelve MLLMs manually annotated by 211 annotators, and 20k synthetic samples generated using a graph-based method that captures dependencies among error types. We demonstrated that ZINA outperformed existing methods, including GPT-4o and LLama-3.2, in both detection and editing tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes