CLJan 26, 2021

Neural machine translation, corpus and frugality

arXiv:2101.10650v10.73 citations

Originality Synthesis-oriented

AI Analysis

This addresses the need for more resource-efficient translation systems in academia and industry, offering an incremental alternative to large-scale models.

The paper tackles the problem of developing efficient neural machine translation systems by proposing 'frugal' bilingual models trained on small corpora, estimating optimal sizes such as 75 million source-language examples and 6 million target-language examples to match human translator standards.

In machine translation field, in both academia and industry, there is a growing interest in increasingly powerful systems, using corpora of several hundred million to several billion examples. These systems represent the state-of-the-art. Here we defend the idea of developing in parallel <<frugal>> bilingual translation systems, trained with relatively small corpora. Based on the observation of a standard human professional translator, we estimate that the corpora should be composed at maximum of a monolingual sub-corpus of 75 million examples for the source language, a second monolingual sub-corpus of 6 million examples for the target language, and an aligned bilingual sub-corpus of 6 million bi-examples. A less desirable alternative would be an aligned bilingual corpus of 47.5 million bi-examples.

View on arXiv PDF

Similar