AIMay 11

Constraint-Data-Value-Maximization: Utilizing Data Attribution for Effective Data Pruning in Low-Data Environments

Danilo Brajovic, David A. Kreplin, Marco F. Huber

arXiv:2605.1131251.1

Predicted impact top 69% in AI · last 90 daysOriginality Incremental advance

AI Analysis

For researchers working on data attribution and pruning, this work addresses a specific limitation of existing methods in low-data regimes, offering an incremental improvement.

The paper identifies that Shapley-based data values are suboptimal for pruning low-value data in low-data environments. It proposes Constraint-Data-Value-Maximization (CDVM), which formulates pruning as a constrained optimization to maximize total influence while penalizing per-test contributions, achieving strong performance on the OpenDataVal benchmark.

Attributing model behavior to training data is an evolving research field. A common benchmark is data removal, which involves eliminating data instances with either low or high values, then assessing a model's performance trained on the modified dataset. Many existing studies leverage Shapley-based data values for this task. In this paper, we demonstrate that these data values are not optimally suited for pruning low-value data when only a limited amount of data remains. To address this limitation, we introduce the Constraint-Data-Value-Maximization (CDVM) approach, which effectively utilizes data attributions for pruning in low-data scenarios. By casting pruning as a constrained optimization that both maximizes total influence and penalizes excessive per-test contributions, CDVM delivers robust performance when only a small fraction of the data is retained. On the OpenDataVal benchmark, CDVM shows strong performance and competitive runtime.

View on arXiv PDF

Similar