CL LGDec 18, 2024

Language verY Rare for All

Ibrahim Merad, Amos Wolf, Ziad Mazzawi, Yannick Léo

arXiv:2412.13924v111.519 citationsh-index: 4COLING Workshops

Originality Incremental advance

AI Analysis

This work addresses the problem of machine translation for rare languages, enabling access for speakers of such languages, but it is incremental as it builds on existing techniques like fine-tuning and RAG.

The paper tackles machine translation for rare languages like Monégasque, which lack existing tools due to limited data, by introducing LYRA, a method combining open LLM fine-tuning, RAG, and transfer learning; it shows LYRA frequently surpasses and consistently matches state-of-the-art encoder-decoder models in performance.

In the quest to overcome language barriers, encoder-decoder models like NLLB have expanded machine translation to rare languages, with some models (e.g., NLLB 1.3B) even trainable on a single GPU. While general-purpose LLMs perform well in translation, open LLMs prove highly competitive when fine-tuned for specific tasks involving unknown corpora. We introduce LYRA (Language verY Rare for All), a novel approach that combines open LLM fine-tuning, retrieval-augmented generation (RAG), and transfer learning from related high-resource languages. This study is exclusively focused on single-GPU training to facilitate ease of adoption. Our study focuses on two-way translation between French and Monégasque, a rare language unsupported by existing translation tools due to limited corpus availability. Our results demonstrate LYRA's effectiveness, frequently surpassing and consistently matching state-of-the-art encoder-decoder models in rare language translation.

View on arXiv PDF

Similar