CLIRLGJan 8

SampoNLP: A Self-Referential Toolkit for Morphological Analysis of Subword Tokenizers

arXiv:2601.04469v12 citationsh-index: 2Has Code
AI Analysis

This provides practical guidance for NLP practitioners working with low-resource, agglutinative languages, though it is incremental in improving evaluation methods.

The authors tackled the problem of evaluating subword tokenizers for morphologically rich Uralic languages by developing SampoNLP, a toolkit that creates high-purity morphological lexicons without corpora, and used it to identify optimal vocabulary sizes (e.g., 8k-256k) for BPE tokenizers in Finnish, Hungarian, and Estonian, demonstrating limitations of standard BPE.

The quality of subword tokenization is critical for Large Language Models, yet evaluating tokenizers for morphologically rich Uralic languages is hampered by the lack of clean morpheme lexicons. We introduce SampoNLP, a corpus-free toolkit for morphological lexicon creation using MDL-inspired Self-Referential Atomicity Scoring, which filters composite forms through internal structural cues - suited for low-resource settings. Using the high-purity lexicons generated by SampoNLP for Finnish, Hungarian, and Estonian, we conduct a systematic evaluation of BPE tokenizers across a range of vocabulary sizes (8k-256k). We propose a unified metric, the Integrated Performance Score (IPS), to navigate the trade-off between morpheme coverage and over-splitting. By analyzing the IPS curves, we identify the "elbow points" of diminishing returns and provide the first empirically grounded recommendations for optimal vocabulary sizes (k) in these languages. Our study not only offers practical guidance but also quantitatively demonstrates the limitations of standard BPE for highly agglutinative languages. The SampoNLP library and all generated resources are made publicly available: https://github.com/AragonerUA/SampoNLP

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes