CLLGDec 13, 2024

Efficient Continual Pre-training of LLMs for Low-resource Languages

arXiv:2412.10244v112 citationsh-index: 12Has CodeNAACL
Originality Incremental advance
AI Analysis

This addresses the challenge of improving model performance for low-resource languages at reduced computational and data costs, which is incremental as it builds on existing CPT methods.

The paper tackles the problem of high cost in continual pre-training (CPT) of large language models for low-resource languages by developing algorithms for selecting text subsets and tokens, achieving effective performance with very little CPT data on nine Indian languages using IndicGenBench.

Open-source Large Language models (OsLLMs) propel the democratization of natural language research by giving the flexibility to augment or update model parameters for performance improvement. Nevertheless, like proprietary LLMs, Os-LLMs offer poorer performance on low-resource languages (LRLs) than high-resource languages (HRLs), owing to smaller amounts of training data and underrepresented vocabulary. On the other hand, continual pre-training (CPT) with large amounts of language-specific data is a costly proposition in terms of data acquisition and computational resources. Our goal is to drastically reduce CPT cost. To that end, we first develop a new algorithm to select a subset of texts from a larger corpus. We show the effectiveness of our technique using very little CPT data. In search of further improvement, we design a new algorithm to select tokens to include in the LLM vocabulary. We experiment with the recent Llama-3 model and nine Indian languages with diverse scripts and extent of resource availability. For evaluation, we use IndicGenBench, a generation task benchmark dataset for Indic languages. We experiment with various CPT corpora and augmented vocabulary size and offer insights across language families.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes