CLApr 9, 2020

On the Language Neutrality of Pre-trained Multilingual Representations

arXiv:2004.05160v41016 citations
Originality Incremental advance
AI Analysis

This addresses the problem of cross-lingual representation quality for multilingual NLP tasks, offering incremental improvements in language neutrality.

The paper investigates the language-neutrality of multilingual contextual embeddings, finding they are more language-neutral and informative than aligned static embeddings but only moderately so by default, and proposes two methods to improve neutrality while achieving state-of-the-art accuracy on language identification and matching statistical methods for word alignment without parallel data.

Multilingual contextual embeddings, such as multilingual BERT and XLM-RoBERTa, have proved useful for many multi-lingual tasks. Previous work probed the cross-linguality of the representations indirectly using zero-shot transfer learning on morphological and syntactic tasks. We instead investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics. Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings, which are explicitly trained for language neutrality. Contextual embeddings are still only moderately language-neutral by default, so we propose two simple methods for achieving stronger language neutrality: first, by unsupervised centering of the representation for each language and second, by fitting an explicit projection on small parallel data. Besides, we show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences without using parallel data.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes