CLAIMar 4, 2023

Self-tuning hyper-parameters for unsupervised cross-lingual tokenization

arXiv:2303.02427v21 citationsh-index: 3
AI Analysis

This work addresses the challenge of unsupervised tokenization for low-resource and dead languages, though it is incremental as it builds on existing models with new fitness functions.

The paper tackled the problem of language-independent unsupervised tokenization by meta-learning hyper-parameters for English, Russian, and Chinese, finding good correlations between fitness functions and conventional F1 scores, such as additive combinations for English and Russian and compression factor for Chinese.

We explore the possibility of meta-learning for the language-independent unsupervised tokenization problem for English, Russian, and Chinese. We implement the meta-learning approach for automatic determination of hyper-parameters of the unsupervised tokenization model proposed in earlier works, relying on various human-independent fitness functions such as normalised anti-entropy, compression factor and cross-split F1 score, as well as additive and multiplicative composite combinations of the three metrics, testing them against the conventional F1 tokenization score. We find a fairly good correlation between the latter and the additive combination of the former three metrics for English and Russian. In case of Chinese, we find a significant correlation between the F 1 score and the compression factor. Our results suggest the possibility of robust unsupervised tokenization of low-resource and dead languages and allow us to think about human languages in terms of the evolution of efficient symbolic communication codes with different structural optimisation schemes that have evolved in different human cultures.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes