CLMay 12, 2018

Huge Automatically Extracted Training Sets for Multilingual Word Sense Disambiguation

arXiv:1805.04685v14 citations
Originality Incremental advance
AI Analysis

This addresses the data scarcity issue for supervised WSD in multiple languages, particularly benefiting low-resourced languages, though it is incremental as it builds on existing WordNet and annotation methods.

The authors tackled the problem of multilingual Word Sense Disambiguation by releasing six large-scale sense-annotated datasets covering millions of sentences, which experiments show surpass state-of-the-art results for low-resourced languages and provide competitive results for English.

We release to the community six large-scale sense-annotated datasets in multiple language to pave the way for supervised multilingual Word Sense Disambiguation. Our datasets cover all the nouns in the English WordNet and their translations in other languages for a total of millions of sense-tagged sentences. Experiments prove that these corpora can be effectively used as training sets for supervised WSD systems, surpassing the state of the art for low-resourced languages and providing competitive results for English, where manually annotated training sets are accessible. The data is available at trainomatic.org.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes