CLLGFeb 19, 2018

Learning Word Vectors for 157 Languages

arXiv:1802.06893v21578 citations
AI Analysis

This provides pre-trained word vectors for many languages, aiding natural language processing tasks, but it is incremental as it extends existing methods to new data.

The authors trained high-quality word vector representations for 157 languages using Wikipedia and Common Crawl data, and introduced new evaluation datasets for French, Hindi, and Polish, showing strong performance compared to previous models on 10 languages.

Distributed word representations, or word vectors, have recently been applied to many tasks in natural language processing, leading to state-of-the-art performance. A key ingredient to the successful application of these representations is to train them on very large corpora, and use these pre-trained models in downstream tasks. In this paper, we describe how we trained such high quality word representations for 157 languages. We used two sources of data to train these models: the free online encyclopedia Wikipedia and data from the common crawl project. We also introduce three new word analogy datasets to evaluate these word vectors, for French, Hindi and Polish. Finally, we evaluate our pre-trained word vectors on 10 languages for which evaluation datasets exists, showing very strong performance compared to previous models.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes