CLAIApr 30, 2025

Fine-Tuning LLMs for Low-Resource Dialect Translation: The Case of Lebanese

arXiv:2505.00114v14 citationsh-index: 1Has Code
Originality Incremental advance
AI Analysis

This addresses the challenge of dialect translation for low-resource languages, emphasizing cultural authenticity over data volume, though it is incremental in applying existing fine-tuning methods to a specific case.

The paper tackled the problem of translating the low-resource Lebanese dialect by fine-tuning LLMs, finding that models trained on a smaller, culturally authentic dataset outperformed those using larger, non-native data, with contrastive fine-tuning achieving the best results.

This paper examines the effectiveness of Large Language Models (LLMs) in translating the low-resource Lebanese dialect, focusing on the impact of culturally authentic data versus larger translated datasets. We compare three fine-tuning approaches: Basic, contrastive, and grammar-hint tuning, using open-source Aya23 models. Experiments reveal that models fine-tuned on a smaller but culturally aware Lebanese dataset (LW) consistently outperform those trained on larger, non-native data. The best results were achieved through contrastive fine-tuning paired with contrastive prompting, which indicates the benefits of exposing translation models to bad examples. In addition, to ensure authentic evaluation, we introduce LebEval, a new benchmark derived from native Lebanese content, and compare it to the existing FLoRes benchmark. Our findings challenge the "More Data is Better" paradigm and emphasize the crucial role of cultural authenticity in dialectal translation. We made our datasets and code available on Github.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes