How Language-Neutral is Multilingual BERT?
This work addresses the challenge of building better language-neutral representations for multilingual tasks, particularly in semantics, but is incremental as it builds on prior probing studies.
The authors tackled the problem of assessing the language-neutrality of multilingual BERT (mBERT) by analyzing its semantic properties, showing that mBERT can be split into language-specific and language-neutral components, with the latter achieving high accuracy in word-alignment and sentence retrieval but not in MT quality estimation.
Multilingual BERT (mBERT) provides sentence representations for 104 languages, which are useful for many multi-lingual tasks. Previous work probed the cross-linguality of mBERT using zero-shot transfer learning on morphological and syntactic tasks. We instead focus on the semantic properties of mBERT. We show that mBERT representations can be split into a language-specific component and a language-neutral component, and that the language-neutral component is sufficiently general in terms of modeling semantics to allow high-accuracy word-alignment and sentence retrieval but is not yet good enough for the more difficult task of MT quality estimation. Our work presents interesting challenges which must be solved to build better language-neutral representations, particularly for tasks requiring linguistic transfer of semantics.