SI LGFeb 15, 2021

A Hidden Challenge of Link Prediction: Which Pairs to Check?

Caleb Belth, Alican Büyükçakır, Danai Koutra

arXiv:2102.07878v1

Originality Incremental advance

AI Analysis

This addresses a fundamental issue for researchers and practitioners in network analysis by improving the efficiency and effectiveness of link prediction in real-world scenarios, though it is incremental as it builds on existing models like SBMs and proximity methods.

The paper tackles the practical challenge in link prediction where there is no predefined test set, requiring methods to search a quadratic and sparse space of node pairs, and introduces LinkWaldo, a framework that selects candidate pairs by combining structural resemblance with proximity, resulting in candidate sets containing 7-33% more missing and future links than baselines on 13 networks.

The traditional setup of link prediction in networks assumes that a test set of node pairs, which is usually balanced, is available over which to predict the presence of links. However, in practice, there is no test set: the ground-truth is not known, so the number of possible pairs to predict over is quadratic in the number of nodes in the graph. Moreover, because graphs are sparse, most of these possible pairs will not be links. Thus, link prediction methods, which often rely on proximity-preserving embeddings or heuristic notions of node similarity, face a vast search space, with many pairs that are in close proximity, but that should not be linked. To mitigate this issue, we introduce LinkWaldo, a framework for choosing from this quadratic, massively-skewed search space of node pairs, a concise set of candidate pairs that, in addition to being in close proximity, also structurally resemble the observed edges. This allows it to ignore some high-proximity but low-resemblance pairs, and also identify high-resemblance, lower-proximity pairs. Our framework is built on a model that theoretically combines Stochastic Block Models (SBMs) with node proximity models. The block structure of the SBM maps out where in the search space new links are expected to fall, and the proximity identifies the most plausible links within these blocks, using locality sensitive hashing to avoid expensive exhaustive search. LinkWaldo can use any node representation learning or heuristic definition of proximity, and can generate candidate pairs for any link prediction method, allowing the representation power of current and future methods to be realized for link prediction in practice. We evaluate LinkWaldo on 13 networks across multiple domains, and show that on average it returns candidate sets containing 7-33% more missing and future links than both embedding-based and heuristic baselines' sets.

View on arXiv PDF

Similar