CL LGJul 5, 2013

Polyglot: Distributed Word Representations for Multilingual NLP

Rami Al-Rfou, Bryan Perozzi, Steven Skiena

arXiv:1307.1662v2494 citations

Originality Synthesis-oriented

AI Analysis

This work provides publicly released multilingual word embeddings to aid researchers in developing multilingual NLP applications, but it is incremental as it extends existing methods to new data.

The authors trained word embeddings for over 100 languages using Wikipedia data and showed they achieve competitive performance as features for part-of-speech tagging in English, Danish, and Swedish, with results near state-of-the-art.

Distributed word representations (word embeddings) have recently contributed to competitive performance in language modeling and several NLP tasks. In this work, we train word embeddings for more than 100 languages using their corresponding Wikipedias. We quantitatively demonstrate the utility of our word embeddings by using them as the sole features for training a part of speech tagger for a subset of these languages. We find their performance to be competitive with near state-of-art methods in English, Danish and Swedish. Moreover, we investigate the semantic features captured by these embeddings through the proximity of word groupings. We will release these embeddings publicly to help researchers in the development and enhancement of multilingual applications.

View on arXiv PDF

Similar