On the Robustness of Unsupervised and Semi-supervised Cross-lingual Word Embedding Learning
This work addresses the reliability of cross-lingual embeddings for NLP applications, revealing limitations in current state-of-the-art systems, which is incremental as it focuses on evaluation rather than proposing new methods.
The paper tackles the problem of evaluating the robustness of unsupervised and semi-supervised cross-lingual word embedding methods, finding that high-quality embeddings cannot always be learned without much supervision, as shown through extensive analysis across variables like language pairs and training corpora.
Cross-lingual word embeddings are vector representations of words in different languages where words with similar meaning are represented by similar vectors, regardless of the language. Recent developments which construct these embeddings by aligning monolingual spaces have shown that accurate alignments can be obtained with little or no supervision. However, the focus has been on a particular controlled scenario for evaluation, and there is no strong evidence on how current state-of-the-art systems would fare with noisy text or for language pairs with major linguistic differences. In this paper we present an extensive evaluation over multiple cross-lingual embedding models, analyzing their strengths and limitations with respect to different variables such as target language, training corpora and amount of supervision. Our conclusions put in doubt the view that high-quality cross-lingual embeddings can always be learned without much supervision.