CL CVJun 10, 2025

ClimateViz: A Benchmark for Statistical Reasoning and Fact Verification on Scientific Charts

Ruiran Su, Jiasheng Si, Zhijiang Guo, Janet B. Pierrehumbert

arXiv:2506.08700v26.71 citationsh-index: 47Has CodeEMNLP

Originality Incremental advance

AI Analysis

This addresses the gap in scientific fact-checking for charts, which is crucial for researchers and analysts, though it is incremental as it builds on existing multimodal reasoning benchmarks.

The authors tackled the problem of scientific fact-checking on charts by introducing ClimateViz, a large-scale benchmark with 49,862 claims linked to 2,896 visualizations, and found that state-of-the-art multimodal models achieve only 76.2-77.8% accuracy, far below human performance of 89.3-92.7%.

Scientific fact-checking has mostly focused on text and tables, overlooking scientific charts, which are key for presenting quantitative evidence and statistical reasoning. We introduce ClimateViz, the first large-scale benchmark for scientific fact-checking using expert-curated scientific charts. ClimateViz contains 49,862 claims linked to 2,896 visualizations, each labeled as support, refute, or not enough information. To improve interpretability, each example includes structured knowledge graph explanations covering trends, comparisons, and causal relations. We evaluate state-of-the-art multimodal language models, including both proprietary and open-source systems, in zero-shot and few-shot settings. Results show that current models struggle with chart-based reasoning: even the best systems, such as Gemini 2.5 and InternVL 2.5, reach only 76.2 to 77.8 percent accuracy in label-only settings, far below human performance (89.3 and 92.7 percent). Explanation-augmented outputs improve performance in some models. We released our dataset and code alongside the paper.

View on arXiv PDF Code

Similar