Liudmila Ostroumova

IR
3papers
30citations
Novelty38%
AI Score20

3 Papers

IRJul 23, 2013
Timely crawling of high-quality ephemeral new content

Damien Lefortier, Liudmila Ostroumova, Egor Samosvat et al.

Nowadays, more and more people use the Web as their primary source of up-to-date information. In this context, fast crawling and indexing of newly created Web pages has become crucial for search engines, especially because user traffic to a significant fraction of these new pages (like news, blog and forum posts) grows really quickly right after they appear, but lasts only for several days. In this paper, we study the problem of timely finding and crawling of such ephemeral new pages (in terms of user interest). Traditional crawling policies do not give any particular priority to such pages and may thus crawl them not quickly enough, and even crawl already obsolete content. We thus propose a new metric, well thought out for this task, which takes into account the decrease of user interest for ephemeral pages over time. We show that most ephemeral new pages can be found at a relatively small set of content sources and present a procedure for finding such a set. Our idea is to periodically recrawl content sources and crawl newly created pages linked from them, focusing on high-quality (in terms of user interest) content. One of the main difficulties here is to divide resources between these two activities in an efficient way. We find the adaptive balance between crawls and recrawls by maximizing the proposed metric. Further, we incorporate search engine click logs to give our crawler an insight about the current user demands. Efficiency of our approach is finally demonstrated experimentally on real-world data.

IRSep 20, 2012
Evolution of the Media Web

Damien Lefortier, Liudmila Ostroumova, Egor Samosvat

We present a detailed study of the part of the Web related to media content, i.e., the Media Web. Using publicly available data, we analyze the evolution of incoming and outgoing links from and to media pages. Based on our observations, we propose a new class of models for the appearance of new media content on the Web where different \textit{attractiveness} functions of nodes are possible including ones taken from well-known preferential attachment and fitness models. We analyze these models theoretically and empirically and show which ones realistically predict both the incoming degree distribution and the so-called \textit{recency property} of the Media Web, something that existing models did not do well. Finally we compare these models by estimating the likelihood of the real-world link graph from our data set given each model and obtain that models we introduce are significantly more likely than previously proposed ones. One of the most surprising results is that in the Media Web the probability for a post to be cited is determined, most likely, by its quality rather than by its current popularity.

SIAug 11, 2012
Empirical Validation of the Buckley--Osthus Model for the Web Host Graph: Degree and Edge Distributions

Maxim Zhukovskiy, Dmitry Vinogradov, Yuri Pritykin et al.

There has been a lot of research on random graph models for large real-world networks such as those formed by hyperlinks between web pages in the world wide web. Though largely successful qualitatively in capturing their key properties, such models may lack important quantitative characteristics of Internet graphs. While preferential attachment random graph models were shown to be capable of reflecting the degree distribution of the webgraph, their ability to reflect certain aspects of the edge distribution was not yet well studied. In this paper, we consider the Buckley--Osthus implementation of preferential attachment and its ability to model the web host graph in two aspects. One is the degree distribution that we observe to follow the power law, as often being the case for real-world graphs. Another one is the two-dimensional edge distribution, the number of edges between vertices of given degrees. We fit a single "initial attractiveness" parameter $a$ of the model, first with respect to the degree distribution of the web host graph, and then, absolutely independently, with respect to the edge distribution. Surprisingly, the values of $a$ we obtain turn out to be nearly the same. Therefore the same model with the same value of the parameter $a$ fits very well the two independent and basic aspects of the web host graph. In addition, we demonstrate that other models completely lack the asymptotic behavior of the edge distribution of the web host graph, even when accurately capturing the degree distribution. To the best of our knowledge, this is the first attempt for a real graph of Internet to describe the distribution of edges between vertices with respect to their degrees.