Alfredo Oneto, Blazhe Gjorgiev, Giovanni Sansavini
Many data clustering applications must handle objects that cannot be represented as vectors. In this context, the bag-of-vectors representation describes complex objects through discrete distributions, for which the Wasserstein distance provides a well-conditioned dissimilarity measure. Kernel methods extend this by embedding distance information into feature spaces that facilitate analysis. However, an unsupervised framework that combines kernels with Wasserstein distances for clustering distributional data is still lacking. We address this gap by introducing a computationally tractable framework that integrates Wasserstein metrics with kernel methods for clustering. The framework can accommodate both vectorial and distributional data, enabling applications in various domains. It comprises three components: (i) an efficient approximation of pairwise Wasserstein distances using multiple reference distributions; (ii) shifted positive definite kernel functions based on Wasserstein distances, combined with kernel principal component analysis for feature mapping; and (iii) scalable, distance-agnostic validity indices for clustering evaluation and kernel parameter optimization. Experiments on power distribution graphs and real-world time series demonstrate the effectiveness and efficiency of the proposed framework.