CLSep 2, 2021

Similarity of Sentence Representations in Multilingual LMs: Resolving Conflicting Literature and Case Study of Baltic Languages

arXiv:2109.01207v45 citationsHas Code
AI Analysis

It clarifies a foundational issue in cross-lingual representation analysis for low-resource languages like Baltic, though it is incremental as it builds on prior contradictory findings.

This study resolves conflicting literature on whether multilingual language models project different languages into a shared cross-lingual space, showing that with specific pooling or similarity choices, languages do converge, and finds that Baltic languages belong to this shared space based on 378 pairwise comparisons.

Low-resource languages, such as Baltic languages, benefit from Large Multilingual Models (LMs) that possess remarkable cross-lingual transfer performance capabilities. This work is an interpretation and analysis study into cross-lingual representations of Multilingual LMs. Previous works hypothesized that these LMs internally project representations of different languages into a shared cross-lingual space. However, the literature produced contradictory results. In this paper, we revisit the prior work claiming that "BERT is not an Interlingua" and show that different languages do converge to a shared space in such language models with another choice of pooling strategy or similarity index. Then, we perform cross-lingual representational analysis for the two most popular multilingual LMs employing 378 pairwise language comparisons. We discover that while most languages share joint cross-lingual space, some do not. However, we observe that Baltic languages do belong to that shared space. The code is available at https://github.com/TartuNLP/xsim.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes