IRAug 26, 2017

Effective Blog Pages Extractor for Better UGC Accessing

arXiv:1708.07935v1
Originality Incremental advance
AI Analysis

This addresses the need for better user-generated content access and device adaptation, but it is incremental as it builds on existing extraction methods with a novel approach.

The paper tackles the problem of extracting main content from blog pages by removing noisy elements like advertisements, presenting a template-independent extractor that uses DOM-Tree conversion and SVM classifiers with spatial and content features, achieving effective results on 2,250 pages from nine diverse blog sites.

Blog is becoming an increasingly popular media for information publishing. Besides the main content, most of blog pages nowadays also contain noisy information such as advertisements etc. Removing these unrelated elements can improves user experience, but also can better adapt the content to various devices such as mobile phones. Though template-based extractors are highly accurate, they may incur expensive cost in that a large number of template need to be developed and they will fail once the template is updated. To address these issues, we present a novel template-independent content extractor for blog pages. First, we convert a blog page into a DOM-Tree, where all elements including the title and body blocks in a page correspond to subtrees. Then we construct subtree candidate set for the title and the body blocks respectively, and extract both spatial and content features for elements contained in the subtree. SVM classifiers for the title and the body blocks are trained using these features. Finally, the classifiers are used to extract the main content from blog pages. We test our extractor on 2,250 blog pages crawled from nine blog sites with obviously different styles and templates. Experimental results verify the effectiveness of our extractor.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes