CLJan 19, 2018

Size vs. Structure in Training Corpora for Word Embedding Models: Araneum Russicum Maximum and Russian National Corpus

arXiv:1801.06407v111 citations
Originality Synthesis-oriented
AI Analysis

This work addresses corpus selection for Russian word embeddings, providing insights for NLP researchers, but it is incremental as it builds on known comparisons between web and curated corpora.

The researchers compared word embedding models trained on two large Russian corpora, Araneum Russicum Maximum (10+ billion web words) and the Russian National Corpus, using the Multilingual SimLex999 dataset, finding that the RNC generally performed better and was more robust for semantic similarity tasks, with specific differences in model behavior and dataset issues identified.

In this paper, we present a distributional word embedding model trained on one of the largest available Russian corpora: Araneum Russicum Maximum (over 10 billion words crawled from the web). We compare this model to the model trained on the Russian National Corpus (RNC). The two corpora are much different in their size and compilation procedures. We test these differences by evaluating the trained models against the Russian part of the Multilingual SimLex999 semantic similarity dataset. We detect and describe numerous issues in this dataset and publish a new corrected version. Aside from the already known fact that the RNC is generally a better training corpus than web corpora, we enumerate and explain fine differences in how the models process semantic similarity task, what parts of the evaluation set are difficult for particular models and why. Additionally, the learning curves for both models are described, showing that the RNC is generally more robust as training material for this task.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes