CLJan 13

Get away with less: Need of source side data curation to build parallel corpus for low resource Machine Translation

arXiv:2601.08629v10.61 citationsh-index: 4

Originality Incremental advance

AI Analysis

This work addresses the high cost and scarcity of training data for low-resource machine translation, offering a practical solution for researchers and practitioners in NLP, though it is incremental as it builds on existing data curation methods.

The paper tackles the problem of building parallel corpora for low-resource machine translation by developing a framework, LALITA, that selects source sentences based on lexical and linguistic features to curate data efficiently, resulting in improved translation quality and reducing data needs by more than half across multiple languages.

Data curation is a critical yet under-researched step in the machine translation training paradigm. To train translation systems, data acquisition relies primarily on human translations and digital parallel sources or, to a limited degree, synthetic generation. But, for low-resource languages, human translation to generate sufficient data is prohibitively expensive. Therefore, it is crucial to develop a framework that screens source sentences to form efficient parallel text, ensuring optimal MT system performance in low-resource environments. We approach this by evaluating English-Hindi bi-text to determine effective sentence selection strategies for optimal MT system training. Our extensively tested framework, (Lexical And Linguistically Informed Text Analysis) LALITA, targets source sentence selection using lexical and linguistic features to curate parallel corpora. We find that by training mostly on complex sentences from both existing and synthetic datasets, our method significantly improves translation quality. We test this by simulating low-resource data availabilty with curated datasets of 50K to 800K English sentences and report improved performances on all data sizes. LALITA demonstrates remarkable efficiency, reducing data needs by more than half across multiple languages (Hindi, Odia, Nepali, Norwegian Nynorsk, and German). This approach not only reduces MT systems training cost by reducing training data requirement, but also showcases LALITA's utility in data augmentation.

View on arXiv PDF

Similar