CLMay 28, 2025

Test-Time Scaling with Repeated Sampling Improves Multilingual Text Generation

arXiv:2505.21941v12 citationsh-index: 38
Originality Incremental advance
AI Analysis

This addresses the challenge of enhancing multilingual text generation for users of language models, though it appears incremental as it extends an existing inference-time technique to a new domain.

The paper tackles the problem of improving multilingual text generation quality by applying test-time scaling with repeated sampling, achieving consistent quality improvements with gains exceeding 35% in some cases on benchmarks like Aya Evaluation Suite and m-ArenaHard.

Inference-time scaling via repeated sampling has shown promise in reasoning tasks, but its effectiveness in multilingual generation remains underexplored. We evaluate this approach using perplexity- and reward-based verifiers on two multilingual benchmarks: the Aya Evaluation Suite and m-ArenaHard. Our results show consistent quality improvements, with gains exceeding 35% in some cases. While perplexity-based scoring is effective for open-ended prompts, only reward-based verifiers improve performance on tasks requiring reasoning (e.g., math, code). Our results demonstrate the broader utility of repeated sampling for multilingual text generation and underscore the importance of selecting right verifiers for the task.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes