CL LGAug 21, 2025

Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training

arXiv:2508.15390v210.95 citationsh-index: 8

Originality Incremental advance

AI Analysis

This work addresses the problem of optimizing tokenizer-model co-design for NLP researchers and practitioners, offering a principled explanation for vocabulary scaling effects, though it is incremental in clarifying existing practices.

The study investigated the impact of vocabulary size on language model pre-training, finding that larger vocabularies reduce the complexity of tokenized text and lower cross-entropy loss primarily on the 2,500 most frequent words, which cover about 75% of tokens in downstream tasks.

Large language models are trained with tokenizers, and the resulting token distribution is highly imbalanced: a few words dominate the stream while most occur rarely. Recent practice favors ever-larger vocabularies, but it is unclear where the benefit comes from. To this end, we perform a controlled study that scales the vocabulary of the language model from 24K to 196K while holding data, computation, and optimization unchanged. We begin by quantifying the complexity of tokenized text -- formalized via Kolmogorov complexity -- and show that larger vocabularies reduce this complexity. Above 24K, every common word is already tokenized as a single token, so enlarging vocabulary only deepens the relative token-frequency imbalance. Word-level loss decomposition shows that larger vocabularies reduce cross-entropy loss almost exclusively by lowering uncertainty on the 2,500 most frequent words, even though loss on the rare tail rises. The same frequent words cover roughly 75% of tokens in downstream benchmarks, so this training advantage transfers intact. We further show that enlarging model parameters with a fixed vocabulary yields the same frequent-word benefit. Our results recast "bigger vocabularies help" as "lowering complexity of tokenized text helps," offering a simple, principled knob for tokenizer-model co-design and clarifying the loss dynamics that govern language model scaling in pre-training.

View on arXiv PDF

Similar