LG CLFeb 27, 2025

MMSciBench: Benchmarking Language Models on Chinese Multimodal Scientific Problems

Xinwu Ye, Chengfan Li, Siming Chen, Wei Wei, Xiangru Tang

arXiv:2503.01891v216.97 citationsh-index: 2Has CodeACL

Originality Synthesis-oriented

AI Analysis

This work addresses the need for rigorous evaluation of multimodal scientific understanding in AI models, though it is incremental as it builds on existing benchmarking efforts.

The authors tackled the problem of evaluating scientific reasoning capabilities of language and vision-language models by introducing MMSciBench, a benchmark for Chinese multimodal scientific problems, and found that even the best model achieved only 63.77% accuracy, with particular struggles in visual reasoning tasks.

Recent advances in large language models (LLMs) and vision-language models (LVLMs) have shown promise across many tasks, yet their scientific reasoning capabilities remain untested, particularly in multimodal settings. We present MMSciBench, a benchmark for evaluating mathematical and physical reasoning through text-only and text-image formats, with human-annotated difficulty levels, solutions with detailed explanations, and taxonomic mappings. Evaluation of state-of-the-art models reveals significant limitations, with even the best model achieving only \textbf{63.77\%} accuracy and particularly struggling with visual reasoning tasks. Our analysis exposes critical gaps in complex reasoning and visual-textual integration, establishing MMSciBench as a rigorous standard for measuring progress in multimodal scientific understanding. The code for MMSciBench is open-sourced at GitHub, and the dataset is available at Hugging Face.

View on arXiv PDF

Similar