AIFeb 6, 2025

Unifying and Optimizing Data Values for Selection via Sequential-Decision-Making

Hongliang Chi, Qiong Wu, Zhengyi Zhou, Jonathan Light, Emily Dodwell, Yao Ma

arXiv:2502.04554v15.81 citationsh-index: 4

Originality Incremental advance

AI Analysis

This work addresses data selection for machine learning practitioners, offering a theoretical framework and efficient method, though it builds incrementally on existing data valuation approaches.

The paper tackles the problem of data selection by reformulating data valuation as a sequential-decision-making problem, showing that optimal data values can be derived via dynamic programming and unifying existing methods like Data Shapley as approximations. It proposes an efficient approximation scheme using learned bipartite graphs as surrogate utility models, with experiments demonstrating effectiveness across diverse datasets.

Data selection has emerged as a crucial downstream application of data valuation. While existing data valuation methods have shown promise in selection tasks, the theoretical foundations and full potential of using data values for selection remain largely unexplored. In this work, we first demonstrate that data values applied for selection can be naturally reformulated as a sequential-decision-making problem, where the optimal data value can be derived through dynamic programming. We show this framework unifies and reinterprets existing methods like Data Shapley through the lens of approximate dynamic programming, specifically as myopic reward function approximations to this sequential problem. Furthermore, we analyze how sequential data selection optimality is affected when the ground-truth utility function exhibits monotonic submodularity with curvature. To address the computational challenges in obtaining optimal data values, we propose an efficient approximation scheme using learned bipartite graphs as surrogate utility models, ensuring greedy selection is still optimal when the surrogate utility is correctly specified and learned. Extensive experiments demonstrate the effectiveness of our approach across diverse datasets.

View on arXiv PDF

Similar