Robust Node Affinities via Jaccard-Biased Random Walks and Rank Aggregation
This method provides a robust and interpretable tool for network analysts and machine learning practitioners to estimate node similarity, especially in sparse, noisy, or heterogeneous networks.
This paper introduces TopKGraphs, a method for estimating node similarity in networks. It uses Jaccard-biased random walks and robust rank aggregation to construct node-to-node affinity matrices. TopKGraphs achieved competitive or superior performance compared to standard similarity measures, a diffusion-based method, and an embedding-based approach across various synthetic and real-world networks.
Estimating node similarity is a fundamental task in network analysis and graph-based machine learning, with applications in clustering, community detection, classification, and recommendation. We propose TopKGraphs, a method based on start-node-anchored random walks that bias transitions toward nodes with structurally similar neighborhoods, measured via Jaccard similarity. Rather than computing stationary distributions, walks are treated as stochastic neighborhood samplers, producing partial node rankings that are aggregated using robust rank aggregation to construct interpretable node-to-node affinity matrices. TopKGraphs provides a non-parametric, interpretable, and general-purpose representation of node similarity that can be applied in both network analysis and machine learning workflows. We evaluate the method on synthetic graphs (stochastic block models, Lancichinetti-Fortunato-Radicchi benchmark graphs), k-nearest-neighbor graphs from tabular datasets, and a curated high-confidence protein-protein interaction network. Across all scenarios, TopKGraphs achieves competitive or superior performance compared to standard similarity measures (Jaccard, Dice), a diffusion-based method (personalized PageRank), and an embedding-based approach (Node2Vec), demonstrating robustness in sparse, noisy, or heterogeneous networks. These results suggest that TopKGraphs is a versatile and interpretable tool for bridging simple local similarity measures with more complex embedding-based approaches, facilitating both data mining and network analysis applications.