CLFeb 24, 2025

Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning

Guijin Son, Jiwoo Hong, Hyunwoo Ko, James Thorne

arXiv:2502.17407v227.437 citationsh-index: 8Has CodeACL

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of multilingual performance in AI reasoning for researchers and practitioners, showing that test-time scaling is incremental and may not transfer well beyond English.

The paper tackles the problem of whether test-time scaling methods generalize effectively across languages in mathematical reasoning, finding that while methods like Budget Forcing yield large improvements in English (e.g., 20-point gain on AIME), they provide only small average gains (e.g., 1.94 points) across other languages, indicating limited linguistic generalizability.

Scaling pre-training compute has proven effective for achieving mulitlinguality, but does the same hold for test-time scaling? In this work, we introduce MCLM, a multilingual math benchmark featuring competition-level problems in 55 languages. We test three test-time scaling methods-Outcome Reward Modeling (ORM), Process Reward Modeling (ORM), and Budget Forcing (BF)-on both Qwen2.5-1.5B Math and MR1-1.5B, a multilingual LLM we trained for extended reasoning. Our experiments show that using Qwen2.5-1.5B Math with ORM achieves a score of 35.8 on MCLM, while BF on MR1-1.5B attains 35.2. Although "thinking LLMs" have recently garnered significant attention, we find that their performance is comparable to traditional scaling methods like best-of-N once constrained to similar levels of inference FLOPs. Moreover, while BF yields a 20-point improvement on English AIME, it provides only a 1.94-point average gain across other languages-a pattern consistent across the other test-time scaling methods we studied-higlighting that test-time scaling may not generalize as effectively to multilingual tasks. To foster further research, we release MCLM, MR1-1.5B, and evaluation results.

View on arXiv PDF Code

Similar