CLFeb 6, 2025

BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation

The Omnilingual MT Team, Pierre Andrews, Mikel Artetxe, Mariano Coria Meglioli, Marta R. Costa-jussà, Joe Chuang, David Dale, Cynthia Gao, Jean Maillard, Alex Mourachko, Christophe Ropers, Safiyyah Saleem

arXiv:2502.04314v213.09 citationsh-index: 35EMNLP

Originality Synthesis-oriented

AI Analysis

This addresses the need for universal translation quality evaluation by providing a dataset that enables crowd-sourced extensions for any written language, though it is incremental as it builds on existing translation datasets.

The authors introduced BOUQuET, a multi-way, multicentric, multi-domain dataset and benchmark for translation quality evaluation, handcrafted in 8 non-English languages to serve as pivot languages for more accurate translations. They showed it has broader domain representation and simplifies translation tasks for non-experts compared to related datasets.

BOUQuET is a multi-way, multicentric and multi-register/domain dataset and benchmark, and a broader collaborative initiative. This dataset is handcrafted in 8 non-English languages. Each of these source languages are representative of the most widely spoken ones and therefore they have the potential to serve as pivot languages that will enable more accurate translations. The dataset is multicentric to enforce representation of multilingual language features. In addition, the dataset goes beyond the sentence level, as it is organized in paragraphs of various lengths. Compared with related machine translation datasets, we show that BOUQuET has a broader representation of domains while simplifying the translation task for non-experts. Therefore, BOUQuET is specially suitable for crowd-source extension for which we are launching a call aiming at collecting a multi-way parallel corpus covering any written language.

View on arXiv PDF

Similar