DB IRMay 4, 2015

Harvesting Entities from the Web Using Unique Identifiers -- IBEX

Aliaksandr Talaika, Joanna Biega, Antoine Amarilli, Fabian M. Suchanek

arXiv:1505.00841v118 citations

Originality Incremental advance

AI Analysis

This work addresses the need for large-scale, accurate entity extraction from the Web, which is incremental as it builds on existing identifier systems but applies novel filtering techniques.

The paper tackled the problem of extracting uniquely identified entities from the Web, such as books and products, by harvesting identifiers like ISBNs and GTINs, resulting in a database with millions of entities at 73-96% accuracy and high coverage compared to existing knowledge bases.

In this paper we study the prevalence of unique entity identifiers on the Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with human-readable names for the entities at large scale. Starting with a simple extraction of identifiers and names from Web pages, we show how we can use the properties of unique identifiers to filter out noise and clean up the extraction result on the entire corpus. The end result is a database of millions of uniquely identified entities of different types, with an accuracy of 73--96% and a very high coverage compared to existing knowledge bases. We use this database to compute novel statistics on the presence of products, people, and other entities on the Web.

View on arXiv PDF

Similar