Automatic Detection of Webpages that Share the Same Web Template
This addresses the need for efficient template extraction in web development and indexing, though it appears incremental as it builds on existing template extraction goals.
The paper tackles the problem of identifying webpages that share the same template without extensive prior analysis, introducing a hyperlink-based technique that computes a small, high-confidence set of such webpages.
Template extraction is the process of isolating the template of a given webpage. It is widely used in several disciplines, including webpages development, content extraction, block detection, and webpages indexing. One of the main goals of template extraction is identifying a set of webpages with the same template without having to load and analyze too many webpages prior to identifying the template. This work introduces a new technique to automatically discover a reduced set of webpages in a website that implement the template. This set is computed with an hyperlink analysis that computes a very small set with a high level of confidence.