AIMay 11

Constraint-Data-Value-Maximization: Utilizing Data Attribution for Effective Data Pruning in Low-Data Environments

arXiv:2605.1131251.1
Predicted impact top 69% in AI · last 90 daysOriginality Incremental advance
AI Analysis

For researchers working on data attribution and pruning, this work addresses a specific limitation of existing methods in low-data regimes, offering an incremental improvement.

The paper identifies that Shapley-based data values are suboptimal for pruning low-value data in low-data environments. It proposes Constraint-Data-Value-Maximization (CDVM), which formulates pruning as a constrained optimization to maximize total influence while penalizing per-test contributions, achieving strong performance on the OpenDataVal benchmark.

Attributing model behavior to training data is an evolving research field. A common benchmark is data removal, which involves eliminating data instances with either low or high values, then assessing a model's performance trained on the modified dataset. Many existing studies leverage Shapley-based data values for this task. In this paper, we demonstrate that these data values are not optimally suited for pruning low-value data when only a limited amount of data remains. To address this limitation, we introduce the Constraint-Data-Value-Maximization (CDVM) approach, which effectively utilizes data attributions for pruning in low-data scenarios. By casting pruning as a constrained optimization that both maximizes total influence and penalizes excessive per-test contributions, CDVM delivers robust performance when only a small fraction of the data is retained. On the OpenDataVal benchmark, CDVM shows strong performance and competitive runtime.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes