MLLGApr 28, 2025

Coreset selection for the Sinkhorn divergence and generic smooth divergences

arXiv:2504.20194v22 citationsh-index: 2
Originality Incremental advance
AI Analysis

This provides an efficient method for data reduction in machine learning applications, though it appears incremental as it builds on existing coreset and kernel quadrature frameworks.

The paper tackles the problem of coreset selection for smooth divergences by introducing CO2, an algorithm that reduces the problem to maximum mean discrepancy minimization using functional Taylor expansion. For the Sinkhorn divergence, it achieves poly-logarithmic sample complexity to match random sampling guarantees, with applications demonstrated on image data subsampling.

We introduce CO2, an efficient algorithm to produce convexly-weighted coresets with respect to generic smooth divergences. By employing a functional Taylor expansion, we show a local equivalence between sufficiently regular losses and their second order approximations, reducing the coreset selection problem to maximum mean discrepancy minimization. We apply CO2 to the Sinkhorn divergence, providing a novel sampling procedure that requires poly-logarithmically many data points to match the approximation guarantees of random sampling. To show this, we additionally verify several new regularity properties for entropically regularized optimal transport of independent interest. Our approach leads to a new perspective linking coreset selection and kernel quadrature to classical statistical methods such as moment and score matching. We showcase this method with a practical application of subsampling image data, and highlight key directions to explore for improved algorithmic efficiency and theoretical guarantees.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes