CL AIFeb 24, 2025

UrduLLaMA 1.0: Dataset Curation, Preprocessing, and Evaluation in Low-Resource Settings

Layba Fiaz, Munief Hassan Tahir, Sana Shams, Sarmad Hussain

arXiv:2502.16961v12.72 citationsh-index: 2Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the problem of poor LLM performance for low-resource languages like Urdu, though it is incremental as it adapts an existing architecture with targeted fine-tuning.

The paper tackled the suboptimal performance of multilingual LLMs on low-resource languages like Urdu by introducing UrduLLaMA 1.0, which achieved significant performance improvements on machine translation datasets compared to SOTA models, establishing a new benchmark for Urdu LLMs.

Multilingual Large Language Models (LLMs) often provide suboptimal performance on low-resource languages like Urdu. This paper introduces UrduLLaMA 1.0, a model derived from the open-source Llama-3.1-8B-Instruct architecture and continually pre-trained on 128 million Urdu tokens, capturing the rich diversity of the language. To enhance instruction-following and translation capabilities, we leverage Low-Rank Adaptation (LoRA) to fine tune the model on 41,000 Urdu instructions and approximately 50,000 English-Urdu translation pairs. Evaluation across three machine translation datasets demonstrates significant performance improvements compared to state-of-the-art (SOTA) models, establishing a new benchmark for Urdu LLMs. These findings underscore the potential of targeted adaptation strategies with limited data and computational resources to address the unique challenges of low-resource languages.

View on arXiv PDF

Similar