MMSciBench: Benchmarking Language Models on Chinese Multimodal Scientific Problems
This work addresses the need for rigorous evaluation of multimodal scientific understanding in AI models, though it is incremental as it builds on existing benchmarking efforts.
The authors tackled the problem of evaluating scientific reasoning capabilities of language and vision-language models by introducing MMSciBench, a benchmark for Chinese multimodal scientific problems, and found that even the best model achieved only 63.77% accuracy, with particular struggles in visual reasoning tasks.
Recent advances in large language models (LLMs) and vision-language models (LVLMs) have shown promise across many tasks, yet their scientific reasoning capabilities remain untested, particularly in multimodal settings. We present MMSciBench, a benchmark for evaluating mathematical and physical reasoning through text-only and text-image formats, with human-annotated difficulty levels, solutions with detailed explanations, and taxonomic mappings. Evaluation of state-of-the-art models reveals significant limitations, with even the best model achieving only \textbf{63.77\%} accuracy and particularly struggling with visual reasoning tasks. Our analysis exposes critical gaps in complex reasoning and visual-textual integration, establishing MMSciBench as a rigorous standard for measuring progress in multimodal scientific understanding. The code for MMSciBench is open-sourced at GitHub, and the dataset is available at Hugging Face.