IRNov 17, 2014

PDD Crawler: A focused web crawler using link and content analysis for relevance prediction

Prashant Dahiwale, M M Raghuwanshi, Latesh malik

arXiv:1411.4366v11 citations

AI Analysis

This addresses the challenge of web search relevance for users like computer enthusiasts, but it appears incremental as it builds on existing crawling strategies.

The paper tackles the problem of efficiently searching the rapidly growing web for specific information by proposing the PDD Crawler, which uses a combined link and content analysis approach to predict page relevance, resulting in a method that computes page weight to identify the most relevant pages.

Majority of the computer or mobile phone enthusiasts make use of the web for searching activity. Web search engines are used for the searching; The results that the search engines get are provided to it by a software module known as the Web Crawler. The size of this web is increasing round-the-clock. The principal problem is to search this huge database for specific information. To state whether a web page is relevant to a search topic is a dilemma. This paper proposes a crawler called as PDD crawler which will follow both a link based as well as a content based approach. This crawler follows a completely new crawling strategy to compute the relevance of the page. It analyses the content of the page based on the information contained in various tags within the HTML source code and then computes the total weight of the page. The page with the highest weight, thus has the maximum content and highest relevance.

View on arXiv PDF

Similar