CLLGDec 18, 2024

Language verY Rare for All

arXiv:2412.13924v119 citationsh-index: 4COLING Workshops
Originality Incremental advance
AI Analysis

This work addresses the problem of machine translation for rare languages, enabling access for speakers of such languages, but it is incremental as it builds on existing techniques like fine-tuning and RAG.

The paper tackles machine translation for rare languages like Monégasque, which lack existing tools due to limited data, by introducing LYRA, a method combining open LLM fine-tuning, RAG, and transfer learning; it shows LYRA frequently surpasses and consistently matches state-of-the-art encoder-decoder models in performance.

In the quest to overcome language barriers, encoder-decoder models like NLLB have expanded machine translation to rare languages, with some models (e.g., NLLB 1.3B) even trainable on a single GPU. While general-purpose LLMs perform well in translation, open LLMs prove highly competitive when fine-tuned for specific tasks involving unknown corpora. We introduce LYRA (Language verY Rare for All), a novel approach that combines open LLM fine-tuning, retrieval-augmented generation (RAG), and transfer learning from related high-resource languages. This study is exclusively focused on single-GPU training to facilitate ease of adoption. Our study focuses on two-way translation between French and Monégasque, a rare language unsupported by existing translation tools due to limited corpus availability. Our results demonstrate LYRA's effectiveness, frequently surpassing and consistently matching state-of-the-art encoder-decoder models in rare language translation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes