IR DLMay 30, 2013

A Focused Crawler Combinatory Link and Content Model Based on T-Graph Principles

arXiv:1305.7265v113 citations

Originality Incremental advance

AI Analysis

This work addresses the need for more effective and reliable topic-specific web crawling, which is incremental as it builds on existing methods by integrating them in a new way.

The paper tackles the problem of building a focused web crawler by proposing a novel method that combines link-based and content-based approaches to predict the topical focus of unvisited pages with high accuracy, and uses a T-Graph scoring function to prioritize URLs for efficient downloading.

The two significant tasks of a focused Web crawler are finding relevant topic-specific documents on the Web and analytically prioritizing them for later effective and reliable download. For the first task, we propose a sophisticated custom algorithm to fetch and analyze the most effective HTML structural elements of the page as well as the topical boundary and anchor text of each unvisited link, based on which the topical focus of an unvisited page can be predicted and elicited with a high accuracy. Thus, our novel method uniquely combines both link-based and content-based approaches. For the second task, we propose a scoring function of the relevant URLs through the use of T-Graph (Treasure Graph) to assist in prioritizing the unvisited links that will later be put into the fetching queue. Our Web search system is called the Treasure-Crawler. This research paper embodies the architectural design of the Treasure-Crawler system which satisfies the principle requirements of a focused Web crawler, and asserts the correctness of the system structure including all its modules through illustrations and by the test results.

View on arXiv PDF

Similar