CLOct 17, 2019

SetExpan: Corpus-Based Set Expansion via Context Feature Selection and Rank Ensemble

arXiv:1910.08192v193 citations
Originality Incremental advance
AI Analysis

This work addresses a critical task in knowledge discovery for applications like information extraction and web search, but it is incremental as it builds on prior methods with specific enhancements.

The paper tackles the problem of corpus-based set expansion, which involves finding all entities in a semantic class from a corpus using a few seed examples, by proposing SetExpan to handle noisy context features that cause entity intrusion and semantic drifting, resulting in improved mean average precision on three datasets.

Corpus-based set expansion (i.e., finding the "complete" set of entities belonging to the same semantic class, based on a given corpus and a tiny set of seeds) is a critical task in knowledge discovery. It may facilitate numerous downstream applications, such as information extraction, taxonomy induction, question answering, and web search. To discover new entities in an expanded set, previous approaches either make one-time entity ranking based on distributional similarity, or resort to iterative pattern-based bootstrapping. The core challenge for these methods is how to deal with noisy context features derived from free-text corpora, which may lead to entity intrusion and semantic drifting. In this study, we propose a novel framework, SetExpan, which tackles this problem, with two techniques: (1) a context feature selection method that selects clean context features for calculating entity-entity distributional similarity, and (2) a ranking-based unsupervised ensemble method for expanding entity set based on denoised context features. Experiments on three datasets show that SetExpan is robust and outperforms previous state-of-the-art methods in terms of mean average precision.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes