CLITMLMar 7, 2015

Identifying missing dictionary entries with frequency-conserving context models

arXiv:1503.02120v3Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of expanding lexicons for natural language processing and dictionary curation, though it appears incremental as it builds on a previously developed framework.

The researchers tackled the problem of identifying missing phrase entries in dictionaries by training a frequency-conserving context model on Wiktionary data, resulting in highly effective filters that propose short lists of potential missing entries for editorial review.

In an effort to better understand meaning from natural language texts, we explore methods aimed at organizing lexical objects into contexts. A number of these methods for organization fall into a family defined by word ordering. Unlike demographic or spatial partitions of data, these collocation models are of special importance for their universal applicability. While we are interested here in text and have framed our treatment appropriately, our work is potentially applicable to other areas of research (e.g., speech, genomics, and mobility patterns) where one has ordered categorical data, (e.g., sounds, genes, and locations). Our approach focuses on the phrase (whether word or larger) as the primary meaning-bearing lexical unit and object of study. To do so, we employ our previously developed framework for generating word-conserving phrase-frequency data. Upon training our model with the Wiktionary---an extensive, online, collaborative, and open-source dictionary that contains over 100,000 phrasal-definitions---we develop highly effective filters for the identification of meaningful, missing phrase-entries. With our predictions we then engage the editorial community of the Wiktionary and propose short lists of potential missing entries for definition, developing a breakthrough, lexical extraction technique, and expanding our knowledge of the defined English lexicon of phrases.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes