CLFeb 18, 2025

How does a Language-Specific Tokenizer affect LLMs?

arXiv:2502.12560v28 citationsh-index: 3
Originality Synthesis-oriented
AI Analysis

This addresses the need for empirical analysis on tokenizer design for non-English languages in NLP, though it is incremental as it focuses on a specific language and builds on existing methods.

This study tackled the problem of how language-specific tokenizers affect Large Language Models trained on English data, using Korean as a case study, and found that an extended Korean tokenizer reduces confidence in incorrect predictions and cross-entropy in complex tasks, leading to more stable generation.

The necessity of language-specific tokenizers intuitively appears crucial for effective natural language processing, yet empirical analyses on their significance and underlying reasons are lacking. This study explores how language-specific tokenizers influence the behavior of Large Language Models predominantly trained with English text data, through the case study of Korean. The research unfolds in two main stages: (1) the development of a Korean-specific extended tokenizer and (2) experiments to compare models with the basic tokenizer and the extended tokenizer through various Next Token Prediction tasks. Our in-depth analysis reveals that the extended tokenizer decreases confidence in incorrect predictions during generation and reduces cross-entropy in complex tasks, indicating a tendency to produce less nonsensical outputs. Consequently, the extended tokenizer provides stability during generation, potentially leading to higher performance in downstream tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes