CLMar 20, 2024

Pretraining Language Models Using Translationese

arXiv:2403.13638v325 citationsh-index: 26EMNLP
Originality Incremental advance
AI Analysis

This addresses the data scarcity issue for low-resource languages, offering a practical solution to bridge the pre-training gap with English, though it is incremental as it builds on existing translation and filtering methods.

The paper tackles the problem of pre-training language models for low-resource languages by using translationese as synthetic data, finding that pre-training on filtered synthetic data results in only small performance drops (0.87% for NLU and 2.35% for NLG) compared to clean data, and this gap reduces with added clean data.

In this paper, we explore the utility of translationese as synthetic data created using machine translation for pre-training language models (LMs) for low-resource languages (LRLs). Our simple methodology consists of translating large amounts of web-crawled monolingual documents (clean) into the LRLs, followed by filtering the translated documents using tiny LMs trained on small but clean LRL data. Taking the case of Indian languages, we pre-train LMs from scratch with 28M and 85M parameters, and then fine-tune them for 5 downstream natural language understanding (NLU) and 4 generative (NLG) tasks. We observe that pre-training on filtered synthetic data leads to relative performance drops of only 0.87% for NLU and 2.35% for NLG, compared to pre-training on clean data, and this gap further diminishes upon the inclusion of a small amount of clean data. We also study the impact of synthetic data filtering and the choice of source language for synthetic data generation. Furthermore, evaluating continually pre-trained larger models like Gemma-2B and Llama-3-8B in few-shot settings, we observe that using synthetic data is competitive with using clean data. Our findings suggest that synthetic data shows promise for bridging the pre-training gap between English and LRLs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes