Rk-means: Fast Clustering for Relational Data
This addresses the bottleneck of high computation costs for clustering in relational databases, offering a practical solution for data analysts and database users, though it is incremental as it builds on existing k-means methods.
The paper tackles the problem of clustering relational data without constructing the full data matrix, which is computationally expensive, by introducing Rk-means, a relational k-means algorithm that uses a grid coreset to achieve constant approximation for the k-means objective, resulting in orders-of-magnitude speedup and faster runtime than computing the data matrix alone.
Conventional machine learning algorithms cannot be applied until a data matrix is available to process. When the data matrix needs to be obtained from a relational database via a feature extraction query, the computation cost can be prohibitive, as the data matrix may be (much) larger than the total input relation size. This paper introduces Rk-means, or relational k -means algorithm, for clustering relational data tuples without having to access the full data matrix. As such, we avoid having to run the expensive feature extraction query and storing its output. Our algorithm leverages the underlying structures in relational data. It involves construction of a small {\it grid coreset} of the data matrix for subsequent cluster construction. This gives a constant approximation for the k -means objective, while having asymptotic runtime improvements over standard approaches of first running the database query and then clustering. Empirical results show orders-of-magnitude speedup, and Rk-means can run faster on the database than even just computing the data matrix.