IRMay 25, 2013

ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph

arXiv:1305.5959v2Has Code
AI Analysis

This work addresses the need for better APIs in web archiving to support applications, though it is incremental as it builds on existing Wayback Machine installations.

The authors tackled the problem of limited access to content and structural metadata in web archives by developing ArcLink, a proof-of-concept system that optimizes the construction, storage, and retrieval of the temporal web graph, enabling applications like retrieving inlinks, outlinks, anchortext, and PageRank.

Archiving the web is socially and culturally critical, but presents problems of scale. The Internet Archive's Wayback Machine can replay captured web pages as they existed at a certain point in time, but it has limited ability to provide extensive content and structural metadata about the web graph. While the live web has developed a rich ecosystem of APIs to facilitate web applications (e.g., APIs from Google and Twitter), the web archiving community has not yet broadly implemented this level of access. We present ArcLink, a proof-of-concept system that complements open source Wayback Machine installations by optimizing the construction, storage, and access to the temporal web graph. We divide the web graph construction into four stages (filtering, extraction, storage, and access) and explore optimization for each stage. ArcLink extends the current Web archive interfaces to return content and structural metadata for each URI. We show how this API can be applied to such applications as retrieving inlinks, outlinks, anchortext, and PageRank.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes