CLOct 24, 2024

LLMs for Extremely Low-Resource Finno-Ugric Languages

arXiv:2410.18902v214 citationsh-index: 22NAACL
Originality Synthesis-oriented
AI Analysis

It addresses the problem of linguistic inequality for speakers of extremely low-resource Finno-Ugric languages, representing an incremental advancement in applying existing methods to new data.

This paper tackles the underrepresentation of low-resource Finno-Ugric languages like Võro, Livonian, and Komi in large language models by developing multilingual base and instruction-tuned models, creating evaluation benchmarks including smugri-MT-bench, and conducting human evaluation to promote linguistic diversity in NLP.

The advancement of large language models (LLMs) has predominantly focused on high-resource languages, leaving low-resource languages, such as those in the Finno-Ugric family, significantly underrepresented. This paper addresses this gap by focusing on Võro, Livonian, and Komi. We cover almost the entire cycle of LLM creation, from data collection to instruction tuning and evaluation. Our contributions include developing multilingual base and instruction-tuned models; creating evaluation benchmarks, including the smugri-MT-bench multi-turn conversational benchmark; and conducting human evaluation. We intend for this work to promote linguistic diversity, ensuring that lesser-resourced languages can benefit from advancements in NLP.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes