CL LGApr 28, 2024

Can Perplexity Predict Fine-tuning Performance? An Investigation of Tokenization Effects on Sequential Language Models for Nepali

Nishant Luitel, Nirajan Bekoju, Anand Kumar Sah, Subarna Shakya

arXiv:2404.18071v23.42 citationsh-index: 4

Originality Incremental advance

AI Analysis

This provides insights for developing language models in low-resource languages, though it is incremental as it extends known tokenization effects to a specific domain.

The study investigated how different tokenization strategies affect sequential language models for Nepali, finding that SentencePiece tokenization consistently outperforms byte-level BPE on understanding-based tasks despite the latter's prevalence in models like GPT and LLaMA.

The impact of subword tokenization on language model performance is well-documented for perplexity, with finer granularity consistently reducing this intrinsic metric. However, research on how different tokenization schemes affect a model's understanding capabilities remains limited, particularly for non-Latin script languages. Addressing this gap, we conducted a comprehensive evaluation of six distinct tokenization strategies by pretraining transformer-based language models for Nepali and evaluating their performance across multiple downstream tasks. While recent prominent models like GPT, RoBERTa, Claude, LLaMA, Mistral, Falcon, and MPT have adopted byte-level BPE tokenization, our findings demonstrate that for Nepali, SentencePiece tokenization consistently yields superior results on understanding-based tasks. Unlike previous studies that primarily focused on BERT-based architectures, our research specifically examines sequential transformer models, providing valuable insights for language model development in low-resource languages and highlighting the importance of tokenization strategy beyond perplexity reduction.

View on arXiv PDF

Similar