CLNov 1, 2018

Multilingual Embeddings Jointly Induced from Contexts and Concepts: Simple, Strong and Scalable

Philipp Dufter, Mengjie Zhao, Hinrich Schütze

arXiv:1811.00586v20.52 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of creating effective multilingual embeddings for NLP applications, but it is incremental as it builds on and combines existing methods.

The paper tackles the problem of learning multilingual word embeddings by combining context-based and concept-based approaches, resulting in Co+Co, which achieves the best performance among applicable methods across six tasks on both low-resource and high-resource datasets.

Word embeddings induced from local context are prevalent in NLP. A simple and effective context-based multilingual embedding learner is Levy et al. (2017)'s S-ID (sentence ID) method. Another line of work induces high-performing multilingual embeddings from concepts (Dufter et al., 2018). In this paper, we propose Co+Co, a simple and scalable method that combines context-based and concept-based learning. From a sentence aligned corpus, concepts are extracted via sampling; words are then associated with their concept ID and sentence ID in embedding learning. This is the first work that successfully combines context-based and concept-based embedding learning. We show that Co+Co performs well for two different application scenarios: the Parallel Bible Corpus (1000+ languages, low-resource) and EuroParl (12 languages, high-resource). Among methods applicable to both corpora, Co+Co performs best in our evaluation setup of six tasks.

View on arXiv PDF

Similar