CL AI MMJan 15

SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature

Yiming Ren, Junjie Wang, Yuxin Meng, Yihang Shi, Zhiqiang Lin, Ruihang Chu, Yiran Xu, Ziming Li, Yunfei Zhao, Zihan Wang, Yu Qiao, Ruiming Tang

arXiv:2601.10108v12 citationsh-index: 4

Originality Incremental advance

AI Analysis

This work addresses the problem of assessing true comprehension in multimodal AI for scientific literature, offering a novel evaluation framework that is incremental in improving existing benchmarks.

The paper tackles the challenge of evaluating multimodal large language models' understanding of long scientific papers by proposing the 'Fish-in-the-Ocean' paradigm, which requires explicit cross-modal evidence chains, and introduces SIN-Bench with tasks showing grounding as a bottleneck, where Gemini-3-pro achieves the best overall score of 0.573 and GPT-5 reaches 0.767 accuracy in QA but underperforms on evidence-aligned scores.

Evaluating whether multimodal large language models truly understand long-form scientific papers remains challenging: answer-only metrics and synthetic "Needle-In-A-Haystack" tests often reward answer matching without requiring a causal, evidence-linked reasoning trace in the document. We propose the "Fish-in-the-Ocean" (FITO) paradigm, which requires models to construct explicit cross-modal evidence chains within native scientific documents. To operationalize FITO, we build SIN-Data, a scientific interleaved corpus that preserves the native interleaving of text and figures. On top of it, we construct SIN-Bench with four progressive tasks covering evidence discovery (SIN-Find), hypothesis verification (SIN-Verify), grounded QA (SIN-QA), and evidence-anchored synthesis (SIN-Summary). We further introduce "No Evidence, No Score", scoring predictions when grounded to verifiable anchors and diagnosing evidence quality via matching, relevance, and logic. Experiments on eight MLLMs show that grounding is the primary bottleneck: Gemini-3-pro achieves the best average overall score (0.573), while GPT-5 attains the highest SIN-QA answer accuracy (0.767) but underperforms on evidence-aligned overall scores, exposing a gap between correctness and traceable support.

View on arXiv PDF

Similar