CLMay 1

Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus

Daria Boratyn, Damian Brzyski, Albert Leśniak, Wojciech Łukasik, Maciej Rapacz, Jan Rybicki, Wojciech Słomczyński, Dariusz Stolicki

arXiv:2605.0061825.0

AI Analysis

For researchers using multilingual text embeddings, this work provides a method to assess when machine translation can be trusted to preserve semantic relationships, addressing a practical bottleneck in cross-lingual NLP.

The paper investigates whether cosine similarity between paragraph embeddings is preserved under machine translation, using a multilingual political manifesto corpus. It finds that translation preserves semantic structure for ten languages but introduces detectable distortion for four.

We investigate the extent to which cosine similarity between paragraph embeddings is invariant under machine translation, using the Manifesto Corpus of over 2,800 political party platforms in 28 languages translated to English via the EU eTranslation service. Rather than measuring translation-induced semantic shift directly we measure the stability of pairwise similarity relationships across embedding models, and use inter-model disagreement on original-language text as a calibrated invariance threshold. This yields a per-language non-inferiority test for four hypotheses about how translation interacts with embedding choice, with verdicts that distinguish languages where translation demonstrably preserves semantic structure from those where it demonstrably degrades it and from those where the available evidence does not resolve the question. The framework is corpus- and pipeline-agnostic and extends naturally to downstream tasks. Applied to our data, it identifies ten languages with translation invariance and four with detectable distortion.

View on arXiv PDF

Similar