LGDec 22, 2025

From Points to Coalitions: Hierarchical Contrastive Shapley Values for Prioritizing Data Samples

arXiv:2512.19363v18 citationsh-index: 2
Originality Highly original
AI Analysis

This addresses the need for scalable data valuation methods in machine learning, offering a novel approach that is not incremental but provides significant practical improvements.

The paper tackles the problem of efficiently quantifying the value of training examples in large, heterogeneous datasets by proposing Hierarchical Contrastive Data Valuation (HCDV), which reduces computational complexity from factorial to near-linear time and improves accuracy by up to 5 percentage points in experiments.

How should we quantify the value of each training example when datasets are large, heterogeneous, and geometrically structured? Classical Data-Shapley answers in principle, but its O(n!) complexity and point-wise perspective are ill-suited to modern scales. We propose Hierarchical Contrastive Data Valuation (HCDV), a three-stage framework that (i) learns a contrastive, geometry-preserving representation, (ii) organizes the data into a balanced coarse-to-fine hierarchy of clusters, and (iii) assigns Shapley-style payoffs to coalitions via local Monte-Carlo games whose budgets are propagated downward. HCDV collapses the factorial burden to O(T sum_{l} K_{l}) = O(T K_max log n), rewards examples that sharpen decision boundaries, and regularizes outliers through curvature-based smoothness. We prove that HCDV approximately satisfies the four Shapley axioms with surplus loss O(eta log n), enjoys sub-Gaussian coalition deviation tilde O(1/sqrt{T}), and incurs at most k epsilon_infty regret for top-k selection. Experiments on four benchmarks--tabular, vision, streaming, and a 45M-sample CTR task--plus the OpenDataVal suite show that HCDV lifts accuracy by up to +5 pp, slashes valuation time by up to 100x, and directly supports tasks such as augmentation filtering, low-latency streaming updates, and fair marketplace payouts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes