CLJun 22, 2023

xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages

Meta AI
arXiv:2306.12907v1225 citationsh-index: 48
Originality Incremental advance
AI Analysis

This work addresses the challenge of efficiently assessing bitext mining for low-resource languages, offering a more accurate proxy that reduces the need for expensive mining pipelines, though it is incremental as it builds on the existing xSIM method.

The authors tackled the problem of evaluating bitext mining performance for low-resource languages by introducing xSIM++, an improved proxy score that uses rule-based synthetic examples to better mimic real mining scenarios. They validated it through experiments showing higher correlation with downstream BLEU scores compared to xSIM, providing a reliable and cost-effective evaluation method.

We introduce a new proxy score for evaluating bitext mining based on similarity in a multilingual embedding space: xSIM++. In comparison to xSIM, this improved proxy leverages rule-based approaches to extend English sentences in any evaluation set with synthetic, hard-to-distinguish examples which more closely mirror the scenarios we encounter during large-scale mining. We validate this proxy by running a significant number of bitext mining experiments for a set of low-resource languages, and subsequently train NMT systems on the mined data. In comparison to xSIM, we show that xSIM++ is better correlated with the downstream BLEU scores of translation systems trained on mined bitexts, providing a reliable proxy of bitext mining performance without needing to run expensive bitext mining pipelines. xSIM++ also reports performance for different error types, offering more fine-grained feedback for model development.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes