DSMay 8

Estimating Correlation Clustering Cost in Node-Arrival Stream

Kaiwen Liu, Seba Daniela Villalobos, Qin Zhang

arXiv:2605.0709182.9

AI Analysis

This work addresses the practical need for clustering massive data streams where nodes arrive sequentially, offering a space-efficient approximation with theoretical guarantees.

The paper tackles correlation clustering in the node-arrival stream model, where only nodes arrive and edges are derived via a similarity function. The proposed algorithm, C$^4$Approx, approximates the clustering cost using sublinear space and constant passes, achieving performance comparable to Pivot and PrunedPivot while storing only 2% of nodes.

We study the correlation clustering problem in the node-arrival data stream model. Unlike previous work, where the stream consists of the graph's edges, we focus on the setting in which the stream contains only the nodes. This model better reflects many real-world scenarios in which the data stream naturally consists of raw objects (e.g., images, tweets), and the similar/dissimilar edges are derived through a similarity function. We present C$^4$Approx, a streaming algorithm that approximates the cost of correlation clustering using sublinear space in the number of nodes and a constant number of passes. We further complement this result with lower bounds. Experiments on real-world datasets show that by storing only 2% of the nodes, our algorithm achieves performance comparable to the classic Pivot algorithm and the more recent PrunedPivot algorithm, even on sparse graphs.

View on arXiv PDF

Similar