Sebastiano Vigna

DS
9papers
655citations
Novelty43%
AI Score26

9 Papers

IRJan 26, 2016Code
BUbiNG: Massive Crawling for the Masses

Paolo Boldi, Andrea Marino, Massimo Santini et al.

Although web crawlers have been around for twenty years by now, there is virtually no freely available, opensource crawling software that guarantees high throughput, overcomes the limits of single-machine systems and at the same time scales linearly with the amount of resources available. This paper aims at filling this gap, through the description of BUbiNG, our next-generation web crawler built upon the authors' experience with UbiCrawler [Boldi et al. 2004] and on the last ten years of research on the topic. BUbiNG is an opensource Java fully distributed crawler; a single BUbiNG agent, using sizeable hardware, can crawl several thousands pages per second respecting strict politeness constraints, both host- and IP-based. Unlike existing open-source distributed crawlers that rely on batch techniques (like MapReduce), BUbiNG job distribution is based on modern high-speed protocols so to achieve very high throughput.

SIFeb 2, 2022
Spectral Rank Monotonicity on Undirected Networks

Paolo Boldi, Flavio Furia, Sebastiano Vigna

We study the problem of score and rank monotonicity for spectral ranking methods, such as eigenvector centrality and PageRank, in the case of undirected networks. Score monotonicity means that adding an edge increases the score at both ends of the edge. Rank monotonicity means that adding an edge improves the relative position of both ends of the edge with respect to the remaining nodes. It is known that common spectral rankings are both score and rank monotone on directed, strongly connected graphs. We show that, surprisingly, the situation is very different for undirected graphs, and in particular that PageRank is neither score nor rank monotone.

DSOct 14, 2019
It is high time we let go of the Mersenne Twister

Sebastiano Vigna

When the Mersenne Twister made his first appearance in 1997 it was a powerful example of how linear maps on $\mathbf F_2$ could be used to generate pseudorandom numbers. In particular, the easiness with which generators with long periods could be defined gave the Mersenne Twister a large following, in spite of the fact that such long periods are not a measure of quality, and they require a large amount of memory. Even at the time of its publication, several defects of the Mersenne Twister were predictable, but they were somewhat obscured by other interesting properties. Today the Mersenne Twister is the default generator in C compilers, the Python language, the Maple mathematical computation system, and in many other environments. Nonetheless, knowledge accumulated in the last $20$ years suggests that the Mersenne Twister has, in fact, severe defects, and should never be used as a general-purpose pseudorandom number generator. Many of these results are folklore, or are scattered through very specialized literature. This paper surveys these results for the non-specialist, providing new, simple, understandable examples, and it is intended as a guide for the final user, or for language implementors, so that they can take an informed decision about whether to use the Mersenne Twister or not.

DSMay 3, 2018
Scrambled Linear Pseudorandom Number Generators

David Blackman, Sebastiano Vigna

$\mathbf F_2$-linear pseudorandom number generators are very popular due to their high speed, to the ease with which generators with a sizable state space can be created, and to their provable theoretical properties. However, they suffer from linear artifacts that show as failures in linearity-related statistical tests such as the binary-rank and the linear-complexity test. In this paper, we give two new contributions. First, we introduce two new $\mathbf F_2$-linear transformations that have been handcrafted to have good statistical properties and at the same time to be programmable very efficiently on superscalar processors, or even directly in hardware. Then, we describe some scramblers, that is, nonlinear functions applied to the state array that reduce or delete the linear artifacts, and propose combinations of linear transformations and scramblers that give extremely fast pseudorandom number generators of high quality. A novelty in our approach is that we use ideas from the theory of filtered linear-feedback shift registers to prove some properties of our scramblers, rather than relying purely on heuristics. In the end, we provide simple, extremely fast generators that use a few hundred bits of memory, have provable properties, and pass strong statistical tests.

SIMar 30, 2015
Liquid FM: Recommending Music through Viscous Democracy

Paolo Boldi, Corrado Monti, Massimo Santini et al.

Most modern recommendation systems use the approach of collaborative filtering: users that are believed to behave alike are used to produce recommendations. In this work we describe an application (Liquid FM) taking a completely different approach. Liquid FM is a music recommendation system that makes the user responsible for the recommended items. Suggestions are the result of a voting scheme, employing the idea of viscous democracy. Liquid FM can also be thought of as the first testbed for this voting system. In this paper we outline the design and architecture of the application, both from the theoretical and from the implementation viewpoints.

SIApr 12, 2014
A Weighted Correlation Index for Rankings with Ties

Sebastiano Vigna

Understanding the correlation between two different scores for the same set of items is a common problem in information retrieval, and the most commonly used statistics that quantifies this correlation is Kendall's $τ$. However, the standard definition fails to capture that discordances between items with high rank are more important than those between items with low rank. Recently, a new measure of correlation based on average precision has been proposed to solve this problem, but like many alternative proposals in the literature it assumes that there are no ties in the scores. This is a major deficiency in a number of contexts, and in particular while comparing centrality scores on large graphs, as the obvious baseline, indegree, has a very large number of ties in web and social graphs. We propose to extend Kendall's definition in a natural way to take into account weights in the presence of ties. We prove a number of interesting mathematical properties of our generalization and describe an $O(n\log n)$ algorithm for its computation. We also validate the usefulness of our weighted measure of correlation using experimental data.

DSApr 1, 2014
Further scramblings of Marsaglia's xorshift generators

Sebastiano Vigna

xorshift* generators are a variant of Marsaglia's xorshift generators that eliminate linear artifacts typical of generators based on $\mathbf Z/2\mathbf Z$-linear operations using multiplication by a suitable constant. Shortly after high-dimensional xorshift* generators were introduced, Saito and Matsumoto suggested a different way to eliminate linear artifacts based on addition in $\mathbf Z/2^{32}\mathbf Z$, leading to the XSadd generator. Starting from the observation that the lower bits of XSadd are very weak, as its reverse fails systematically several statistical tests, we explore xorshift+, a variant of XSadd using 64-bit operations, which leads, in small dimension, to extremely fast high-quality generators.

DSJan 22, 2014
An experimental exploration of Marsaglia's xorshift generators, scrambled

Sebastiano Vigna

Marsaglia proposed recently xorshift generators as a class of very fast, good-quality pseudorandom number generators. Subsequent analysis by Panneton and L'Ecuyer has lowered the expectations raised by Marsaglia's paper, showing several weaknesses of such generators, verified experimentally using the TestU01 suite. Nonetheless, many of the weaknesses of xorshift generators fade away if their result is scrambled by a non-linear operation (as originally suggested by Marsaglia). In this paper we explore the space of possible generators obtained by multiplying the result of a xorshift generator by a suitable constant. We sample generators at 100 equispaced points of their state space and obtain detailed statistics that lead us to choices of parameters that improve on the current ones. We then explore for the first time the space of high-dimensional xorshift generators, following another suggestion in Marsaglia's paper, finding choices of parameters providing periods of length $2^{1024} - 1$ and $2^{4096} - 1$. The resulting generators are of extremely high quality, faster than current similar alternatives, and generate long-period sequences passing strong statistical tests using only eight logical operations, one addition and one multiplication by a constant.

IRJun 19, 2012
Quasi-Succinct Indices

Sebastiano Vigna

Compressed inverted indices in use today are based on the idea of gap compression: documents pointers are stored in increasing order, and the gaps between successive document pointers are stored using suitable codes which represent smaller gaps using less bits. Additional data such as counts and positions is stored using similar techniques. A large body of research has been built in the last 30 years around gap compression, including theoretical modeling of the gap distribution, specialized instantaneous codes suitable for gap encoding, and ad hoc document reorderings which increase the efficiency of instantaneous codes. This paper proposes to represent an index using a different architecture based on quasi-succinct representation of monotone sequences. We show that, besides being theoretically elegant and simple, the new index provides expected constant-time operations and, in practice, significant performance improvements on conjunctive, phrasal and proximity queries.