CVCLOct 15, 2025

MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models

arXiv:2510.13276v1h-index: 20
Originality Incremental advance
AI Analysis

This addresses the challenge of assessing long-context faithfulness in multimodal models, which is critical for real-world applications, but it is incremental as it extends existing text-only evaluations to multimodal domains.

The authors tackled the problem of evaluating the fidelity of large vision-language models (LVLMs) in long-context scenarios by introducing MMLongCite, a comprehensive benchmark spanning 8 tasks and 6 context length intervals, and found that state-of-the-art LVLMs show limited faithfulness in handling long multimodal contexts.

The rapid advancement of large vision language models (LVLMs) has led to a significant expansion of their context windows. However, an extended context window does not guarantee the effective utilization of the context, posing a critical challenge for real-world applications. Current evaluations of such long-context faithfulness are predominantly focused on the text-only domain, while multimodal assessments remain limited to short contexts. To bridge this gap, we introduce MMLongCite, a comprehensive benchmark designed to evaluate the fidelity of LVLMs in long-context scenarios. MMLongCite comprises 8 distinct tasks spanning 6 context length intervals and incorporates diverse modalities, including text, images, and videos. Our evaluation of state-of-the-art LVLMs reveals their limited faithfulness in handling long multimodal contexts. Furthermore, we provide an in-depth analysis of how context length and the position of crucial content affect the faithfulness of these models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes