CLLGMay 28

Speculative Decoding Across Languages

arXiv:2605.3058027.5h-index: 1
AI Analysis

This work is significant for researchers and practitioners deploying LLMs in multilingual settings, as it improves the efficiency of speculative decoding for non-English languages, an incremental improvement to existing LLM inference methods.

The paper addresses the reduced effectiveness of speculative decoding for non-English languages due to poor multilingual capabilities of small draft models. It compares three strategies—task-specific finetuning, monolingual finetuning, and n-gram models—finding that n-gram models consistently offer large speed-ups despite lower acceptance rates, due to faster draft generation.

Speculative decoding has become a crucial component of large language model (LLM) inference, enabling faster generation by drafting multiple tokens and verifying them in parallel. However, small draft models tend to suffer from disproportionately poor multilingual capabilities. Thus, when generating text in a non-English language, speculative decoding is far less effective. We compare three strategies to improve speculative decoding efficiency for eleven languages: finetuning the draft model on task-specific data (translation); finetuning the draft model on unlabeled monolingual corpora; and training simple n-gram draft models on the same monolingual corpora. We evaluate efficiency on translation (from English into the target language) and the held-out task of story generation. We find that while task-specific distillation can significantly improve efficiency, distilled models generalize poorly to a new task. Meanwhile, n-gram draft models, despite lower acceptance rates, consistently provide large speed-ups due to much faster draft generation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes