A Fast Template-based Approach to Automatically Identify Primary Text Content of a Web Page
This work addresses the issue of redundant search results for users by improving content extraction from web pages, but it is incremental as it builds upon the existing ContentExtractor algorithm.
The paper tackles the problem of search engines returning irrelevant results due to non-informative web page blocks like ads and navigation links, by proposing FastContentExtractor, a fast algorithm that automatically detects main content blocks using stored templates, maintaining hierarchical order with the same speed as the original ContentExtractor algorithm.
Search engines have become an indispensable tool for browsing information on the Internet. The user, however, is often annoyed by redundant results from irrelevant Web pages. One reason is because search engines also look at non-informative blocks of Web pages such as advertisement, navigation links, etc. In this paper, we propose a fast algorithm called FastContentExtractor to automatically detect main content blocks in a Web page by improving the ContentExtractor algorithm. By automatically identifying and storing templates representing the structure of content blocks in a website, content blocks of a new Web page from the Website can be extracted quickly. The hierarchical order of the output blocks is also maintained which guarantees that the extracted content blocks are in the same order as the original ones.