CLFeb 13, 2018

A Short Survey on Sense-Annotated Corpora

arXiv:1802.04744v4999 citations
AI Analysis

This is an incremental survey that helps researchers in natural language processing by summarizing available datasets to address the knowledge-acquisition bottleneck in sense annotation.

The paper surveys existing sense-annotated corpora used for Word Sense Disambiguation, providing statistics and analysis of datasets across languages and lexical resources like WordNet and BabelNet.

Large sense-annotated datasets are increasingly necessary for training deep supervised systems in Word Sense Disambiguation. However, gathering high-quality sense-annotated data for as many instances as possible is a laborious and expensive task. This has led to the proliferation of automatic and semi-automatic methods for overcoming the so-called knowledge-acquisition bottleneck. In this short survey we present an overview of sense-annotated corpora, annotated either manually- or (semi)automatically, that are currently available for different languages and featuring distinct lexical resources as inventory of senses, i.e. WordNet, Wikipedia, BabelNet. Furthermore, we provide the reader with general statistics of each dataset and an analysis of their specific features.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes