CLMay 25, 2025

MMATH: A Multilingual Benchmark for Mathematical Reasoning

arXiv:2505.19126v112 citationsh-index: 25Has CodeEMNLP
Originality Incremental advance
AI Analysis

This work addresses a gap in evaluating and improving multilingual reasoning for large language models, though it is incremental as it builds on existing benchmarks and methods.

The authors tackled the underexplored problem of multilingual complex reasoning in large language models by introducing MMATH, a benchmark with 374 math problems across 10 languages, revealing that models like DeepSeek R1 show performance disparities and off-target language issues, and they demonstrated strategies to improve performance and consistency.

The advent of large reasoning models, such as OpenAI o1 and DeepSeek R1, has significantly advanced complex reasoning tasks. However, their capabilities in multilingual complex reasoning remain underexplored, with existing efforts largely focused on simpler tasks like MGSM. To address this gap, we introduce MMATH, a benchmark for multilingual complex reasoning spanning 374 high-quality math problems across 10 typologically diverse languages. Using MMATH, we observe that even advanced models like DeepSeek R1 exhibit substantial performance disparities across languages and suffer from a critical off-target issue-generating responses in unintended languages. To address this, we explore strategies including prompting and training, demonstrating that reasoning in English and answering in target languages can simultaneously enhance performance and preserve target-language consistency. Our findings offer new insights and practical strategies for advancing the multilingual reasoning capabilities of large language models. Our code and data could be found at https://github.com/RUCAIBox/MMATH.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes