CLMay 12, 2018

Huge Automatically Extracted Training Sets for Multilingual Word Sense Disambiguation

Tommaso Pasini, Francesco Maria Elia, Roberto Navigli

arXiv:1805.04685v10.84 citations

Originality Incremental advance

AI Analysis

This addresses the data scarcity issue for supervised WSD in multiple languages, particularly benefiting low-resourced languages, though it is incremental as it builds on existing WordNet and annotation methods.

The authors tackled the problem of multilingual Word Sense Disambiguation by releasing six large-scale sense-annotated datasets covering millions of sentences, which experiments show surpass state-of-the-art results for low-resourced languages and provide competitive results for English.

We release to the community six large-scale sense-annotated datasets in multiple language to pave the way for supervised multilingual Word Sense Disambiguation. Our datasets cover all the nouns in the English WordNet and their translations in other languages for a total of millions of sense-tagged sentences. Experiments prove that these corpora can be effectively used as training sets for supervised WSD systems, surpassing the state of the art for low-resourced languages and providing competitive results for English, where manually annotated training sets are accessible. The data is available at trainomatic.org.

View on arXiv PDF

Similar