CV AI CLOct 10, 2025

CapGeo: A Caption-Assisted Approach to Geometric Reasoning

Yuying Li, Siyi Qian, Hao Liang, Leqi Zheng, Ruichuan An, Yongzhen Guo, Wentao Zhang

arXiv:2510.09302v18.42 citationsh-index: 9

Originality Highly original

AI Analysis

This addresses a core bottleneck in MLLMs for tasks like solving geometry problems, offering a new pathway to enhance their capabilities.

The paper tackles the challenge of geometric reasoning in Multimodal Large Language Models (MLLMs) by introducing CapGeo, a caption-assisted framework that converts geometric diagrams into textual descriptions, resulting in substantial performance improvements—for example, Qwen2.5-VL-72B increased from 8.6% to 59.0% accuracy.

Geometric reasoning remains a core challenge for Multimodal Large Language Models (MLLMs). Even the most advanced closed-source systems, such as GPT-O3 and Gemini-2.5-Pro, still struggle to solve geometry problems reliably, despite exhibiting strong textual reasoning abilities on tasks like the International Mathematical Olympiad (IMO). This gap suggests that the bottleneck lies in understanding geometric diagrams rather than reasoning itself. Since geometric figures can often be faithfully described in concise textual form, converting visual content into captions offers a promising direction. Motivated by this insight, we introduce CapGeo, a caption-assisted reasoning framework that bridges visual and textual modalities. Experiments show substantial improvements when models are equipped with captions: Qwen2.5-VL-72B improves from 8.6% (vision-only) to 59.0%, while Claude-Opus-4 rises from 44.8% to 73.0%. To systematically evaluate and identify high-quality geometric captioning models, we further propose CapGeo-Bench, a dataset of 4,641 curated figure-caption pairs. Crucially, CapGeo-Bench incorporates a keypoint-based evaluation metric that correlates strongly with downstream CapGeo performance, enabling reliable assessment of geometric captioning ability. Together, our framework and benchmark highlight a new pathway toward advancing geometric reasoning in MLLMs.

View on arXiv PDF

Similar