Simple Vision-Language Math Reasoning via Rendered Text
This addresses the challenge of math reasoning for vision-language models, though it is incremental as it builds on existing methods with a simple augmentation.
The paper tackles the problem of enabling vision-language models to solve math problems by rendering LaTeX equations into images and using structured prompts, achieving state-of-the-art reasoning accuracy with gains of up to 20% on benchmarks like MMMU, ChartQA, and DocVQA.
We present a lightweight yet effective pipeline for training vision-language models to solve math problems by rendering LaTeX encoded equations into images and pairing them with structured chain-of-thought prompts. This simple text-to-vision augmentation enables compact multimodal architectures to achieve state-of-the-art reasoning accuracy. Through systematic ablations, we find that rendering fidelity and prompt design are the primary drivers of performance. Despite its simplicity, our approach consistently matches or surpasses both open-source and proprietary math-focused vision-language solvers on widely used benchmarks, while preserving broad general-domain competence - showing gains on tasks such as MMMU, ChartQA, and DocVQA of up to 20%.