CLDec 29, 2025

Benchmarking Cross-Lingual Semantic Alignment in Multilingual Embeddings

arXiv:2601.09732v1

Originality Incremental advance

AI Analysis

This work provides a semantic benchmarking tool to help practitioners select quality multilingual embeddings, addressing a gap in guidance for model selection.

The paper tackles the problem of evaluating cross-lingual semantic alignment in multilingual embeddings, introducing a new metric and framework that reveals a three-tier structure among models, with top models achieving SA scores around 0.70 and others plateauing or failing below 0.50.

With hundreds of multilingual embedding models available, practitioners lack clear guidance on which provide genuine cross-lingual semantic alignment versus task performance through language-specific patterns. Task-driven benchmarks (MTEB) may mask fundamental alignment shortcomings. We introduce Semantic Affinity (SA), a bounded (between 0 and 1) metric measuring inter-lingual to intra-lingual spread ratio using cosine distance, combined with PHATE visualization in our Semanscope framework. Benchmarking 13 models across 4 datasets (52 experiments) reveals a three-tier structure: (1) Top BERT models (LaBSE SA = 0.70, USE SA = 0.68, S-BERT SA = 0.68) achieve strong alignment via translation-pair supervision; (2) LLM embeddings plateau at SA between 0.55 and 0.61 regardless of 0.6 B to 8 B scale; (3) MLM-only BERT models (mBERT, XLM-R, SA < 0.50) fail despite more than 100 language training. Training objective, not architecture or scale, determines alignment. Oracle Bone primitives (1200 BCE) expose semantic drift-models learn corpus patterns rather than cognitive primitives. This work provides semantic benchmarking to help practitioners select quality multilingual embeddings from hundreds of available models, showing cross-lingual alignment requires explicit translation supervision, not merely model scale or multilingual data.

View on arXiv PDF

Similar