MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model
This work addresses the problem of narrow focus in mathematical MLLMs for researchers, but it is incremental as it builds on existing methods with new data.
The paper tackled the limited diversity in multi-modal large language models for mathematics by constructing a fine-tuning dataset (MathVL) and developing specialized models (MathGLM-Vision), which achieved significant improvements on benchmarks including a curated test set of 2,000 problems.
Large language models (LLMs) have demonstrated significant capabilities in mathematical reasoning, particularly with text-based mathematical problems. However, current multi-modal large language models (MLLMs), especially those specialized in mathematics, tend to focus predominantly on solving geometric problems but ignore the diversity of visual information available in other areas of mathematics. Moreover, the geometric information for these specialized mathematical MLLMs is derived from several public datasets, which are typically limited in diversity and complexity. To address these limitations, we aim to construct a fine-tuning dataset named MathVL, and develop a series of specialized mathematical MLLMs termed MathGLM-Vision by conducting Supervised Fine-Tuning (SFT) on MathVL with various parameter-scale backbones. To extensively evaluate the effectiveness of MathGLM-Vision, we conduct experiments on several public benchmarks and our curated MathVL-test consisting of 2,000 problems. Experimental results demonstrate that MathGLM-Vision achieves significant improvements compared with some existing models, including backbone models and open-source mathematical MLLMs. These findings indicate the importance of diversity dataset in enhancing the mathematical reasoning abilities of MLLMs.