LGJun 18, 2023

2D-Shapley: A Framework for Fragmented Data Valuation

Zhihong Liu, Hoang Anh Just, Xiangyu Chang, Xi Chen, Ruoxi Jia

arXiv:2306.10473v212.314 citationsh-index: 145Has Code

Originality Incremental advance

AI Analysis

This addresses a critical gap in data valuation for fragmented data, which is important for enhancing transparency and designing incentive systems in machine learning, though it is an incremental advancement building on existing Shapley-based methods.

The paper tackles the problem of valuing fragmented data sources, where each source contains only partial features and samples, by proposing 2D-Shapley, a theoretical framework that satisfies key axioms and enables new use cases like selecting useful fragments and diagnosing data issues.

Data valuation -- quantifying the contribution of individual data sources to certain predictive behaviors of a model -- is of great importance to enhancing the transparency of machine learning and designing incentive systems for data sharing. Existing work has focused on evaluating data sources with the shared feature or sample space. How to valuate fragmented data sources of which each only contains partial features and samples remains an open question. We start by presenting a method to calculate the counterfactual of removing a fragment from the aggregated data matrix. Based on the counterfactual calculation, we further propose 2D-Shapley, a theoretical framework for fragmented data valuation that uniquely satisfies some appealing axioms in the fragmented data context. 2D-Shapley empowers a range of new use cases, such as selecting useful data fragments, providing interpretation for sample-wise data values, and fine-grained data issue diagnosis.

View on arXiv PDF Code

Similar