IR DLAug 6, 2014

Unstable markup: A template-based information extraction from web sites with unstable markup

arXiv:1408.1260v112 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the specific issue of information extraction for semantic publishing in academic contexts, but it is incremental as it applies existing template and linking methods to a new dataset.

The paper tackled the problem of extracting structured data from web sites with unstable markup by developing a template-based crawler for CEUR Workshop proceedings, resulting in a Linked Open Data dataset with entities linked to DBpedia.

This paper presents results of a work on crawling CEUR Workshop proceedings web site to a Linked Open Data (LOD) dataset in the framework of ESWC 2014 Semantic Publishing Challenge 2014. Our approach is based on using an extensible template-dependent crawler and DBpedia for linking extracted entities, such as the names of universities and countries.

View on arXiv PDF

Similar