CLMay 29

Model-Based Quality Assessment for Massively Multilingual Parallel Data

Abdelaziz M. A. Ibrahim, Zihao Li, Jörg Tiedemann, Shaoxiong Ji

arXiv:2606.0028572.4h-index: 7

AI Analysis

For researchers and practitioners working with multilingual parallel data, this work highlights the limitations of universal metrics and the need for direction-specific assessment strategies.

The paper addresses quality assessment in massively multilingual parallel data by decomposing it into parallelism and quality estimation components. It finds that no single model is universally reliable across translation directions, and that direction-aware routing is necessary.

Large-scale multilingual bitext often contains two distinct problems: non-parallel sentence pairs and low-quality translations. We decompose model-based assessment for such data into two independent components: parallelism assessment with multilingual embeddings and reference-free quality estimation (QE). For parallelism, we benchmark four embedding models on FLORES-200 and BOUQuET retrieval tasks, covering 6,654 source--target directions in our target language-pair inventory. For QE, we evaluate nine reference-free evaluators on professional FLORES-200 translations across 41,412 ordered source--target directions. Results show that no model is universally reliable across translation directions. Naive QE ensembles dilute strong model signals, while documented target-language coverage is strongly associated with higher QE scores. Overall, these findings suggest that multilingual parallel-data assessment is best approached as a direction-aware routing and calibration problem, where no single universal metric is expected to suffice across all languages.

View on arXiv PDF

Similar