MCiteBench: A Multimodal Benchmark for Generating Text with Citations
This addresses the issue of unreliable outputs in multimodal AI systems for researchers and developers, though it is incremental as it extends citation generation from text-only to multimodal scenarios.
The paper tackles the problem of hallucination in Multimodal Large Language Models by introducing MCiteBench, the first benchmark to assess their ability to generate text with citations in multimodal contexts, revealing that these models struggle to ground outputs reliably and exhibit systematic modality bias.
Multimodal Large Language Models (MLLMs) have advanced in integrating diverse modalities but frequently suffer from hallucination. A promising solution to mitigate this issue is to generate text with citations, providing a transparent chain for verification. However, existing work primarily focuses on generating citations for text-only content, leaving the challenges of multimodal scenarios largely unexplored. In this paper, we introduce MCiteBench, the first benchmark designed to assess the ability of MLLMs to generate text with citations in multimodal contexts. Our benchmark comprises data derived from academic papers and review-rebuttal interactions, featuring diverse information sources and multimodal content. Experimental results reveal that MLLMs struggle to ground their outputs reliably when handling multimodal input. Further analysis uncovers a systematic modality bias and reveals how models internally rely on different sources when generating citations, offering insights into model behavior and guiding future directions for multimodal citation tasks.