CLAINov 11, 2022

English Contrastive Learning Can Learn Universal Cross-lingual Sentence Embeddings

arXiv:2211.06127v1304 citationsh-index: 91Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of cross-lingual semantic alignment for NLP applications, offering a data-efficient solution that is particularly beneficial for low-resource languages, though it builds incrementally on existing contrastive learning methods.

The paper tackles the problem of learning universal cross-lingual sentence embeddings without requiring supervised parallel data, by extending SimCSE to multilingual settings (mSimCSE) and showing that contrastive learning on English data alone can achieve high-quality results, with unsupervised performance comparable to fully supervised methods in tasks like cross-lingual retrieval and multilingual STS.

Universal cross-lingual sentence embeddings map semantically similar cross-lingual sentences into a shared embedding space. Aligning cross-lingual sentence embeddings usually requires supervised cross-lingual parallel sentences. In this work, we propose mSimCSE, which extends SimCSE to multilingual settings and reveal that contrastive learning on English data can surprisingly learn high-quality universal cross-lingual sentence embeddings without any parallel data. In unsupervised and weakly supervised settings, mSimCSE significantly improves previous sentence embedding methods on cross-lingual retrieval and multilingual STS tasks. The performance of unsupervised mSimCSE is comparable to fully supervised methods in retrieving low-resource languages and multilingual STS. The performance can be further enhanced when cross-lingual NLI data is available. Our code is publicly available at https://github.com/yaushian/mSimCSE.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes