CLAIFeb 25, 2025

Scaling LLM Pre-training with Vocabulary Curriculum

arXiv:2502.17910v13 citationsh-index: 4Has Code
Originality Incremental advance
AI Analysis

This addresses pre-training efficiency for large language models, though it is incremental as it builds on existing curriculum learning and tokenization methods.

The paper tackles the inefficiency of static vocabularies in LLM pre-training by introducing vocabulary curriculum learning, which alternates between entropy-guided vocabulary expansion and model optimization, resulting in log-linear scaling gains relative to vocabulary size.

Modern language models rely on static vocabularies, fixed before pretraining, in contrast to the adaptive vocabulary acquisition observed in human language learning. To bridge this gap, we introduce vocabulary curriculum learning, an approach that improves pretraining efficiency with log-linear scaling gains relative to vocabulary size. Our method alternates between entropy-guided vocabulary expansion and model optimization, enabling models to learn transferable representations across diverse tokenization granularities. This approach naturally gives rise to an optimal computation allocation pattern: longer tokens capture predictable content, while shorter tokens focus on more complex, harder-to-predict contexts. Experiments on small-scale GPT models demonstrate improved scaling efficiency, reinforcing the effectiveness of dynamic tokenization. We release our code to support further research and plan to extend our experiments to larger models and diverse domains.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes