CV CLOct 15, 2025

MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models

Keyan Zhou, Zecheng Tang, Lingfeng Ming, Guanghao Zhou, Qiguang Chen, Dan Qiao, Zheming Yang, Libo Qin, Minghui Qiu, Juntao Li, Min Zhang

arXiv:2510.13276v13.6h-index: 20

Originality Incremental advance

AI Analysis

This addresses the challenge of assessing long-context faithfulness in multimodal models, which is critical for real-world applications, but it is incremental as it extends existing text-only evaluations to multimodal domains.

The authors tackled the problem of evaluating the fidelity of large vision-language models (LVLMs) in long-context scenarios by introducing MMLongCite, a comprehensive benchmark spanning 8 tasks and 6 context length intervals, and found that state-of-the-art LVLMs show limited faithfulness in handling long multimodal contexts.

The rapid advancement of large vision language models (LVLMs) has led to a significant expansion of their context windows. However, an extended context window does not guarantee the effective utilization of the context, posing a critical challenge for real-world applications. Current evaluations of such long-context faithfulness are predominantly focused on the text-only domain, while multimodal assessments remain limited to short contexts. To bridge this gap, we introduce MMLongCite, a comprehensive benchmark designed to evaluate the fidelity of LVLMs in long-context scenarios. MMLongCite comprises 8 distinct tasks spanning 6 context length intervals and incorporates diverse modalities, including text, images, and videos. Our evaluation of state-of-the-art LVLMs reveals their limited faithfulness in handling long multimodal contexts. Furthermore, we provide an in-depth analysis of how context length and the position of crucial content affect the faithfulness of these models.

View on arXiv PDF

Similar