Evaluating Transferability of BERT Models on Uralic Languages
This work addresses the lack of evaluation for transformer models on low-resource Uralic languages, providing tools for linguistic research and NLP applications in these communities, though it is incremental as it applies existing methods to new data.
The study evaluated BERT models on Uralic languages, finding that monolingual models perform best on their native languages but transfer poorly compared to multilingual models, with straightforward transfer achieving state-of-the-art results for POS and NER in minority languages.
Transformer-based language models such as BERT have outperformed previous models on a large number of English benchmarks, but their evaluation is often limited to English or a small number of well-resourced languages. In this work, we evaluate monolingual, multilingual, and randomly initialized language models from the BERT family on a variety of Uralic languages including Estonian, Finnish, Hungarian, Erzya, Moksha, Karelian, Livvi, Komi Permyak, Komi Zyrian, Northern Sámi, and Skolt Sámi. When monolingual models are available (currently only et, fi, hu), these perform better on their native language, but in general they transfer worse than multilingual models or models of genetically unrelated languages that share the same character set. Remarkably, straightforward transfer of high-resource models, even without special efforts toward hyperparameter optimization, yields what appear to be state of the art POS and NER tools for the minority Uralic languages where there is sufficient data for finetuning.