AI IRApr 12, 2018

CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web

Colin Lockard, Xin Luna Dong, Arash Einolghozati, Prashant Shiralkar

arXiv:1804.04635v115.769 citations

Originality Incremental advance

AI Analysis

This enables scalable knowledge base population from diverse web sources, though it is incremental as it builds on existing distant supervision methods.

The paper tackles the problem of extracting relations from semi-structured webpages without manual annotations by using distant supervision to automatically generate training labels from an existing knowledge base, achieving a precision of 90% and harvesting 1.25 million facts from over 400,000 pages.

The web contains countless semi-structured websites, which can be a rich source of information for populating knowledge bases. Existing methods for extracting relations from the DOM trees of semi-structured webpages can achieve high precision and recall only when manual annotations for each website are available. Although there have been efforts to learn extractors from automatically-generated labels, these methods are not sufficiently robust to succeed in settings with complex schemas and information-rich websites. In this paper we present a new method for automatic extraction from semi-structured websites based on distant supervision. We automatically generate training labels by aligning an existing knowledge base with a web page and leveraging the unique structural characteristics of semi-structured websites. We then train a classifier based on the potentially noisy and incomplete labels to predict new relation instances. Our method can compete with annotation-based techniques in the literature in terms of extraction quality. A large-scale experiment on over 400,000 pages from dozens of multi-lingual long-tail websites harvested 1.25 million facts at a precision of 90%.

View on arXiv PDF

Similar