XDoGE: Multilingual Data Reweighting to Enhance Language Inclusivity in LLMs
This addresses language inclusivity for users of underrepresented languages, though it is incremental as it builds on existing reweighting methods.
The paper tackles the problem of LLMs underperforming in mid- and low-resource languages due to training data bias, proposing XDoGE for multilingual data reweighting, which improved performance in languages like Galician and Basque, with specific gains such as a 5% increase in accuracy on IberoBench tasks.
Current large language models (LLMs) are trained on massive amounts of text data, primarily from a few dominant languages. Studies suggest that this over-reliance on high-resource languages, such as English, hampers LLM performance in mid- and low-resource languages. To mitigate this problem, we propose to (i) optimize the language distribution by training a small proxy model within a domain-reweighing DoGE algorithm that we extend to XDoGE for a multilingual setup, and (ii) rescale the data and train a full-size model with the established language weights either from scratch or within a continual pre-training phase (CPT). We target six languages possessing a variety of geographic and intra- and inter-language-family relations, namely, English and Spanish (high-resource), Portuguese and Catalan (mid-resource), Galician and Basque (low-resource). We experiment with Salamandra-2b, which is a promising model for these languages. We investigate the effects of substantial data repetition on minor languages and under-sampling on dominant languages using the IberoBench framework for quantitative evaluation. Finally, we release a new promising IberianLLM-7B-Instruct model centering on Iberian languages and English that we pretrained from scratch and further improved using CPT with the XDoGE weights.