CLJan 7, 2025

BabyLMs for isiXhosa: Data-Efficient Language Modelling in a Low-Resource Context

arXiv:2501.03855v119 citationsCOLING Workshops
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of limited data for low-resource languages like isiXhosa, though it is incremental as it adapts existing methods to a new context.

The paper tackled the problem of language modeling for low-resource languages by applying data-efficient BabyLM architectures to isiXhosa, resulting in performance gains such as +3.2 F1 on NER and sometimes outperforming XLM-R.

The BabyLM challenge called on participants to develop sample-efficient language models. Submissions were pretrained on a fixed English corpus, limited to the amount of words children are exposed to in development (<100m). The challenge produced new architectures for data-efficient language modelling, which outperformed models trained on trillions of words. This is promising for low-resource languages, where available corpora are limited to much less than 100m words. In this paper, we explore the potential of BabyLMs for low-resource languages, using the isiXhosa language as a case study. We pretrain two BabyLM architectures, ELC-BERT and MLSM, on an isiXhosa corpus. They outperform a vanilla pretrained model on POS tagging and NER, achieving notable gains (+3.2 F1) for the latter. In some instances, the BabyLMs even outperform XLM-R. Our findings show that data-efficient models are viable for low-resource languages, but highlight the continued importance, and lack of, high-quality pretraining data. Finally, we visually analyse how BabyLM architectures encode isiXhosa.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes