IRApr 8, 2018

A Structure-Oriented Unsupervised Crawling Strategy for Social Media Sites

arXiv:1804.02734v11 citations
Originality Incremental advance
AI Analysis

This addresses the need for automated, efficient crawling of social media sites without human supervision, but appears incremental as it builds on existing crawling concepts with a structural twist.

The paper tackles the problem of efficiently crawling social media sites by introducing SOUrCe, a structure-oriented unsupervised crawler that uses page structures to learn navigation patterns. Experiments show it better focuses on user-created content than baseline methods, though no concrete performance numbers are provided.

Existing techniques for efficiently crawling social media sites rely on URL patterns, query logs, and human supervision. This paper describes SOUrCe, a structure-oriented unsupervised crawler that uses page structures to learn how to crawl a social media site efficiently. SOUrCe consists of two stages. During its unsupervised learning phase, SOUrCe constructs a sitemap that clusters pages based on their structural similarity and generates a navigation table that describes how the different types of pages in the site are linked together. During its harvesting phase, it uses the navigation table and a crawling policy to guide the choice of which links to crawl next. Experiments show that this architecture supports different styles of crawling efficiently, and does a better job of staying focused on user-created contents than baseline methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes