CLJun 22, 2023

xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages

Mingda Chen, Kevin Heffernan, Onur Çelebi, Alex Mourachko, Holger Schwenk

Meta AI

arXiv:2306.12907v126.3225 citationsh-index: 48Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of efficiently assessing bitext mining for low-resource languages, offering a more accurate proxy that reduces the need for expensive mining pipelines, though it is incremental as it builds on the existing xSIM method.

The authors tackled the problem of evaluating bitext mining performance for low-resource languages by introducing xSIM++, an improved proxy score that uses rule-based synthetic examples to better mimic real mining scenarios. They validated it through experiments showing higher correlation with downstream BLEU scores compared to xSIM, providing a reliable and cost-effective evaluation method.

We introduce a new proxy score for evaluating bitext mining based on similarity in a multilingual embedding space: xSIM++. In comparison to xSIM, this improved proxy leverages rule-based approaches to extend English sentences in any evaluation set with synthetic, hard-to-distinguish examples which more closely mirror the scenarios we encounter during large-scale mining. We validate this proxy by running a significant number of bitext mining experiments for a set of low-resource languages, and subsequently train NMT systems on the mined data. In comparison to xSIM, we show that xSIM++ is better correlated with the downstream BLEU scores of translation systems trained on mined bitexts, providing a reliable proxy of bitext mining performance without needing to run expensive bitext mining pipelines. xSIM++ also reports performance for different error types, offering more fine-grained feedback for model development.

View on arXiv PDF Code

Similar