CLApr 1

M2-Verify: A Large-Scale Multidomain Benchmark for Checking Multimodal Claim Consistency

Abolfazl Ansari, Delvin Ce Zhang, Zhuoyang Zou, Wenpeng Yin, Dongwon Lee

arXiv:2604.0130629.2h-index: 1

Predicted impact top 24% in CL · last 90 daysOriginality Synthesis-oriented

AI Analysis

This addresses the need for better benchmarks in multimodal claim verification for scientific applications, though it is incremental as it focuses on dataset creation and evaluation.

The paper tackles the problem of evaluating the strict consistency between scientific claims and multimodal evidence by introducing M2-Verify, a large-scale dataset with over 469K instances across 16 domains, and finds that state-of-the-art models struggle, with performance dropping from 85.8% to 61.6% on high-complexity challenges.

Evaluating scientific arguments requires assessing the strict consistency between a claim and its underlying multimodal evidence. However, existing benchmarks lack the scale, domain diversity, and visual complexity needed to evaluate this alignment realistically. To address this gap, we introduce M2-Verify, a large-scale multimodal dataset for checking scientific claim consistency. Sourced from PubMed and arXiv, M2-Verify provides over 469K instances across 16 domains, rigorously validated through expert audits. Extensive baseline experiments show that state-of-the-art models struggle to maintain robust consistency. While top models achieve up to 85.8\% Micro-F1 on low-complexity medical perturbations, performance drops to 61.6\% on high-complexity challenges like anatomical shifts. Furthermore, expert evaluations expose hallucinations when models generate scientific explanations for their alignment decisions. Finally, we demonstrate our dataset's utility and provide comprehensive usage guidelines.

View on arXiv PDF

Similar