The Token Tax: Systematic Bias in Multilingual Tokenization
This identifies a structural bias in NLP that disproportionately affects low-resource languages, though the analysis is incremental as it extends known tokenization issues to new data.
The study found that tokenization inefficiency systematically disadvantages morphologically complex, low-resource languages, with higher token fertility (tokens/word) reliably predicting lower accuracy across 10 LLMs on the AfriMMLU benchmark (9,000 items, 16 African languages), and a doubling in tokens quadrupling training costs.
Tokenization inefficiency imposes structural disadvantages on morphologically complex, low-resource languages, inflating compute resources and depressing accuracy. We evaluate 10 large language models (LLMs) on AfriMMLU (9,000 MCQA items; 5 subjects; 16 African languages) and show that fertility (tokens/word) reliably predicts accuracy. Higher fertility consistently predicts lower accuracy across all models and subjects. We further find that reasoning models (DeepSeek, o1) consistently outperform non-reasoning peers across high and low resource languages in the AfriMMLU dataset, narrowing accuracy gaps observed in prior generations. Finally, translating token inflation to economics, a doubling in tokens results in quadrupled training cost and time, underscoring the token tax faced by many languages. These results motivate morphologically aware tokenization, fair pricing, and multilingual benchmarks for equitable natural language processing (NLP).