CLAug 2, 2016

New word analogy corpus for exploring embeddings of Czech words

arXiv:1608.00789v15.321 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This addresses a gap for NLP researchers working with Czech, but it is incremental as it applies existing methods to a new language-specific dataset.

The authors tackled the lack of word analogy resources for Czech by introducing a new corpus to explore syntactic, morphosyntactic, and semantic properties, and they tested Word2Vec and GloVe algorithms on it, making the corpus available for research.

The word embedding methods have been proven to be very useful in many tasks of NLP (Natural Language Processing). Much has been investigated about word embeddings of English words and phrases, but only little attention has been dedicated to other languages. Our goal in this paper is to explore the behavior of state-of-the-art word embedding methods on Czech, the language that is characterized by very rich morphology. We introduce new corpus for word analogy task that inspects syntactic, morphosyntactic and semantic properties of Czech words and phrases. We experiment with Word2Vec and GloVe algorithms and discuss the results on this corpus. The corpus is available for the research community.

View on arXiv PDF Code

Similar