Learning Joint Multilingual Sentence Representations with Neural Machine Translation
This work addresses cross-lingual semantic understanding for multilingual NLP applications, but is incremental as it builds on existing neural machine translation frameworks.
The paper tackled the problem of learning language-independent sentence representations using neural machine translation across six languages, and found that sentences close in embedding space are semantically related despite structural differences.
In this paper, we use the framework of neural machine translation to learn joint sentence representations across six very different languages. Our aim is that a representation which is independent of the language, is likely to capture the underlying semantics. We define a new cross-lingual similarity measure, compare up to 1.4M sentence representations and study the characteristics of close sentences. We provide experimental evidence that sentences that are close in embedding space are indeed semantically highly related, but often have quite different structure and syntax. These relations also hold when comparing sentences in different languages.