CL CVSep 7, 2025

Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge

Hao Liang, Ruitao Wu, Bohan Zeng, Junbo Niu, Wentao Zhang, Bin Dong

arXiv:2509.06079v110.95 citationsh-index: 7Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of multimodal reasoning for AI researchers, offering a competitive solution in a specific domain challenge.

The paper tackled the challenge of multimodal reasoning in AI by introducing a caption-assisted reasoning framework to bridge visual and textual modalities, achieving 1st place in the ICML 2025 SeePhys Challenge and validating generalization on the MathVerse benchmark.

Multimodal reasoning remains a fundamental challenge in artificial intelligence. Despite substantial advances in text-based reasoning, even state-of-the-art models such as GPT-o3 struggle to maintain strong performance in multimodal scenarios. To address this gap, we introduce a caption-assisted reasoning framework that effectively bridges visual and textual modalities. Our approach achieved 1st place in the ICML 2025 AI for Math Workshop \& Challenge 2: SeePhys, highlighting its effectiveness and robustness. Furthermore, we validate its generalization on the MathVerse benchmark for geometric reasoning, demonstrating the versatility of our method. Our code is publicly available at https://github.com/OpenDCAI/SciReasoner.

View on arXiv PDF Code

Similar