Topical Discovery of Web Content
This addresses the challenge of efficiently finding relevant web content for users or systems, but it appears incremental as it builds on existing topical discovery methods.
The paper tackles the problem of automatically discovering and selecting relevant web pages for specific analytical needs, resulting in a tool that reduces false positives by filtering extraneous data and optimizing recommendations based on criteria like duplicate removal and lexical diversity.
This work describes the theory and the implementation of a new software tool, the "Web Topical Discovery System" (WTDS), which provides an approach to the automatic discovery and selection of new web pages relevant to specific analytical needs. We will see how it is possible to specify the research context with search keywords related to the area of interest and consider the important problem of removing extraneous data from a web page containing an article in order to reduce, to a minimum, false positives represented by a match on a keyword that is showing up on the latest news box of the same page. The removal of duplicates, the analysis of richness of information contained in the article and lexical diversity are all taken into consideration in order to provide the optimum set of recommendations to the end user or system.