AI IRJan 14, 2017

Hedera: Scalable Indexing and Exploring Entities in Wikipedia Revision History

arXiv:1701.03937v13.16 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This enables further research on Wikipedia revision history for semantic web studies, though it is incremental as it builds on existing Map-Reduce methods for scalability.

The authors tackled the problem of efficiently extracting semantic information from the massive Wikipedia revision history, which is difficult to access due to its volume, by developing Hedera, a tool that uses Map-Reduce to process an entire Wikipedia article's revision history in a day on a medium-scale cluster.

Much of work in semantic web relying on Wikipedia as the main source of knowledge often work on static snapshots of the dataset. The full history of Wikipedia revisions, while contains much more useful information, is still difficult to access due to its exceptional volume. To enable further research on this collection, we developed a tool, named Hedera, that efficiently extracts semantic information from Wikipedia revision history datasets. Hedera exploits Map-Reduce paradigm to achieve rapid extraction, it is able to handle one entire Wikipedia articles revision history within a day in a medium-scale cluster, and supports flexible data structures for various kinds of semantic web study.

View on arXiv PDF Code

Similar