CL AIJun 6, 2021

The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation

Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krishnan, Marc'Aurelio Ranzato, Francisco Guzman, Angela Fan

arXiv:2106.03193v134.9816 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This provides a high-quality benchmark for the machine translation community, addressing a critical bottleneck in evaluating low-resource and multilingual models, though it is incremental as it builds on existing evaluation needs.

The authors tackled the lack of good evaluation benchmarks for low-resource and multilingual machine translation by introducing FLORES-101, a dataset of 3001 sentences translated into 101 languages by professional translators, enabling better assessment of model quality on low-resource languages and many-to-many multilingual systems.

One of the biggest challenges hindering progress in low-resource and multilingual machine translation is the lack of good evaluation benchmarks. Current evaluation benchmarks either lack good coverage of low-resource languages, consider only restricted domains, or are low quality because they are constructed using semi-automatic procedures. In this work, we introduce the FLORES-101 evaluation benchmark, consisting of 3001 sentences extracted from English Wikipedia and covering a variety of different topics and domains. These sentences have been translated in 101 languages by professional translators through a carefully controlled process. The resulting dataset enables better assessment of model quality on the long tail of low-resource languages, including the evaluation of many-to-many multilingual translation systems, as all translations are multilingually aligned. By publicly releasing such a high-quality and high-coverage dataset, we hope to foster progress in the machine translation community and beyond.

View on arXiv PDF Code

Similar