7.2PRApr 9
Planted clique recovery in random geometric graphsKonstantin Avrachenkov, Andrei Bobu, Nelly Litvak et al.
We investigate the problem of identifying planted cliques in random geometric graphs, focusing on two distinct algorithmic approaches: the first based on vertex degrees (VD) and the other on common neighbors (CN). We analyze the performance of these methods under varying regimes of key parameters, namely the average degree of the graph and the size of the planted clique. We demonstrate that exact recovery is achieved with high probability as the graph size increases, in a specific set of parameters. Notably, our results reveal that the CN-algorithm significantly outperforms the VD-algorithm. In particular, in the connectivity regime, tiny planted cliques (even edges) are correctly identified by the CN-algorithm, yielding a significant impact on anomaly detection. Finally, our results are confirmed by a series of numerical experiments, showing that the devised algorithms are effective in practice.
14.5PRMay 11
The stochastic block model has the overlap graph property for modularityShankar Bhamidi, David Gamarnik, Remco van der Hofstad et al.
The overlap gap property (OGP) is a statement about the geometry of near-optimal solutions. Exhibiting OGP implies failure of a class of local algorithms; and has been observed to coincide with conjectured algorithmic limits in problems with statistical computational gap. We consider the Stochastic Block Model (SBM), where the graph has a planted partition with $k$ equal-size blocks which form the `communities', and where, for parameters $p>q$, vertices within the same community connect with probability $p$, while vertices in different communities connect with probability $q$, independently across pairs of vertices. Modularity--based clustering algorithms have become ubiquitous in applications. This article studies theoretical limits of local algorithms based on the modularity score on the SBM. We establish that modularity exhibits OGP on the SBM. This rules out a class of local algorithms based on modularity for recovery in the SBM, and shows slow mixing time for a related Markov Chain. Theoretically this is one of the few instances where OGP has been established for a `planted' model, as most such analyses to date consider the `null' model. As part of our analysis, we extend a result by Bickel and Chen 2009, who established that with high probability, the modularity optimal partition of SBM is $o(n)$ local moves away from the planted partition, where $n$ is the graph size. We show that, with high probability, any partition with modularity score sufficiently near the optimal value is close to the planted partition.
70.0OCMay 1
Linking PageRank, Time Reversal, and Policy EvaluationKonstantin Avrachenkov, Lorenzo Gregoris, Nelly Litvak
We establish a connection between policy evaluation in Markov decision processes and PageRank in network analysis. For a fixed policy, we show that the value function of a discounted Markov decision process can be obtained, up to an explicit rescaling, from the PageRank vector of a suitably defined time-reversed Markov chain. In this correspondence, the discount factor plays the role of the teleportation parameter, while rewards induce the restart distribution. Beyond the irreducible case, invoking quasi-stationary distributions and Doob $h$-transforms, we prove a general decomposition theorem showing that policy evaluation for arbitrary finite MDPs reduces to a collection of PageRank problems on the recurrent and transient components of the policy-induced Markov chain. This framework naturally extends to undiscounted MDPs with terminal states and to transition-dependent rewards. We conclude by showing efficiency of our approach on a numerical example of a sticky random walk on large deterministic and random graphs.
LGNov 9, 2021
Look back, look around: a systematic analysis of effective predictors for new outlinks in focused Web crawlingThi Kim Nhung Dang, Doina Bucur, Berk Atil et al.
Small and medium enterprises rely on detailed Web analytics to be informed about their market and competition. Focused crawlers meet this demand by crawling and indexing specific parts of the Web. Critically, a focused crawler must quickly find new pages that have not yet been indexed. Since a new page can be discovered only by following a new outlink, predicting new outlinks is very relevant in practice. In the literature, many feature designs have been proposed for predicting changes in the Web. In this work we provide a structured analysis of this problem, using new outlinks as our running prediction target. Specifically, we unify earlier feature designs in a taxonomic arrangement of features along two dimensions: static versus dynamic features, and features of a page versus features of the network around it. Within this taxonomy, complemented by our new (mainly, dynamic network) features, we identify best predictors for new outlinks. Our main conclusion is that most informative features are the recent history of new outlinks on a page itself, and of its content-related pages. Hence, we propose a new 'look back, look around' (LBLA) model, that uses only these features. With the obtained predictions, we design a number of scoring functions to guide a focused crawler to pages with most new outlinks, and compare their performance. The LBLA approach proved extremely effective, outperforming other models including those that use a most complete set of features. One of the learners we use, is the recent NGBoost method that assumes a Poisson distribution for the number of new outlinks on a page, and learns its parameters. This connects the two so far unrelated avenues in the literature: predictions based on features of a page, and those based on probabilistic modelling. All experiments were carried out on an original dataset, made available by a commercial focused crawler.
SIJul 6, 2021
The Hyperspherical Geometry of Community Detection: Modularity as a DistanceMartijn Gösgens, Remco van der Hofstad, Nelly Litvak
We introduce a metric space of clusterings, where clusterings are described by a binary vector indexed by the vertex-pairs. We extend this geometry to a hypersphere and prove that maximizing modularity is equivalent to minimizing the angular distance to some modularity vector over the set of clustering vectors. In that sense, modularity-based community detection methods can be seen as a subclass of a more general class of projection methods, which we define as the community detection methods that adhere to the following two-step procedure: first, mapping the network to a point on the hypersphere; second, projecting this point to the set of clustering vectors. We show that this class of projection methods contains many interesting community detection methods. Many of these new methods cannot be described in terms of null models and resolution parameters, as is customary for modularity-based methods. We provide a new characterization of such methods in terms of meridians and latitudes of the hypersphere. In addition, by relating the modularity resolution parameter to the latitude of the corresponding modularity vector, we obtain a new interpretation of the resolution limit that modularity maximization is known to suffer from.
SIJun 20, 2018
Mean Field Analysis of Personalized PageRank with Implications for Local Graph ClusteringKonstantin Avrachenkov, Arun Kadavankandy, Nelly Litvak
We analyse a mean-field model of Personalized PageRank on the Erdos-Renyi random graph containing a denser planted Erdos-Renyi subgraph. We investigate the regimes where the values of Personalized PageRank concentrate around the mean-field value. We also study the optimization of the damping factor, the only parameter in Personalized PageRank. Our theoretical results help to understand the applicability of Personalized PageRank and its limitations for local graph clustering.
CVJul 6, 2017
Automated Lane Detection in Crowds using Proximity GraphsStijn Heldens, Claudio Martella, Nelly Litvak et al.
Studying the behavior of crowds is vital for understanding and predicting human interactions in public areas. Research has shown that, under certain conditions, large groups of people can form collective behavior patterns: local interactions between individuals results in global movements patterns. To detect these patterns in a crowd, we assume each person is carrying an on-body device that acts a local proximity sensor, e.g., smartphone or bluetooth badge, and represent the texture of the crowd as a proximity graph. Our goal is extract information about crowds from these proximity graphs. In this work, we focus on one particular type of pattern: lane formation. We present a formal definition of a lane, proposed a simple probabilistic model that simulates lanes moving through a stationary crowd, and present an automated lane-detection method. Our preliminary results show that our method is able to detect lanes of different shapes and sizes. We see our work as an initial step towards rich pattern recognition using proximity graphs.