Boilerplate Detection via Semantic Classification of TextBlocks
This work addresses boilerplate detection for web content analysis, offering an incremental improvement with a novel method for a known bottleneck in information extraction.
The paper tackled the problem of detecting boilerplate in HTML by introducing SemText, a hierarchical neural network model that uses semantic representations of HTML elements, and achieved state-of-the-art accuracy on news webpage datasets while demonstrating robustness on out-of-domain question-answer webpages.
We present a hierarchical neural network model called SemText to detect HTML boilerplate based on a novel semantic representation of HTML tags, class names, and text blocks. We train SemText on three published datasets of news webpages and fine-tune it using a small number of development data in CleanEval and GoogleTrends-2017. We show that SemText achieves the state-of-the-art accuracy on these datasets. We then demonstrate the robustness of SemText by showing that it also detects boilerplate effectively on out-of-domain community-based question-answer webpages.