BilBOWA: Fast Bilingual Distributed Representations without Word Alignments
This enables scalable bilingual representation learning for natural language processing applications by reducing reliance on costly aligned data.
The paper tackles the problem of learning bilingual word representations without requiring word-aligned parallel data, achieving state-of-the-art performance on cross-lingual document classification and lexical translation tasks using WMT11 data.
We introduce BilBOWA (Bilingual Bag-of-Words without Alignments), a simple and computationally-efficient model for learning bilingual distributed representations of words which can scale to large monolingual datasets and does not require word-aligned parallel training data. Instead it trains directly on monolingual data and extracts a bilingual signal from a smaller set of raw-text sentence-aligned data. This is achieved using a novel sampled bag-of-words cross-lingual objective, which is used to regularize two noise-contrastive language models for efficient cross-lingual feature learning. We show that bilingual embeddings learned using the proposed model outperform state-of-the-art methods on a cross-lingual document classification task as well as a lexical translation task on WMT11 data.