TREX: Tokenizer Regression for Optimal Data Mixture
This work addresses the efficiency and cost challenges in multilingual tokenizer design for LLM developers, though it is incremental as it builds on existing tokenizer optimization methods.
The paper tackles the problem of determining optimal language-specific data mixtures for multilingual LLM tokenizers, introducing TREX, a regression-based framework that predicts optimal mixtures and achieves up to 12% improvement in compression efficiency over baseline methods.
Building effective tokenizers for multilingual Large Language Models (LLMs) requires careful control over language-specific data mixtures. While a tokenizer's compression performance critically affects the efficiency of LLM training and inference, existing approaches rely on heuristics or costly large-scale searches to determine optimal language ratios. We introduce Tokenizer Regression for Optimal Data MiXture (TREX), a regression-based framework that efficiently predicts the optimal data mixture for tokenizer training. TREX trains small-scale proxy tokenizers on random mixtures, gathers their compression statistics, and learns to predict compression performance from data mixtures. This learned model enables scalable mixture search before large-scale tokenizer training, mitigating the accuracy-cost trade-off in multilingual tokenizer design. Tokenizers trained with TReX's predicted mixtures outperform mixtures based on LLaMA3 and uniform distributions by up to 12% in both inand out-of-distribution compression efficiency, demonstrating strong scalability, robustness, and practical effectiveness.