CLDec 20, 2021

Training dataset and dictionary sizes matter in BERT models: the case of Baltic languages

arXiv:2112.10553v116 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of developing effective NLP models for low-resource languages like Estonian, Latvian, and Lithuanian, though it is incremental as it builds on established BERT methods.

The authors tackled the problem of training BERT models for low-resource Baltic languages, showing that their newly created trilingual LitLat BERT and monolingual Est-RoBERTa models improved results on tasks like named entity recognition and dependency parsing compared to existing models.

Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. While studies have shown that monolingual models produce better results than multilingual models, the training datasets must be sufficiently large. We trained a trilingual LitLat BERT-like model for Lithuanian, Latvian, and English, and a monolingual Est-RoBERTa model for Estonian. We evaluate their performance on four downstream tasks: named entity recognition, dependency parsing, part-of-speech tagging, and word analogy. To analyze the importance of focusing on a single language and the importance of a large training set, we compare created models with existing monolingual and multilingual BERT models for Estonian, Latvian, and Lithuanian. The results show that the newly created LitLat BERT and Est-RoBERTa models improve the results of existing models on all tested tasks in most situations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes