LGJan 3, 2023

Data Valuation Without Training of a Model

arXiv:2301.00930v244 citationsh-index: 10Has Code
AI Analysis

This addresses the high computational cost of existing data valuation methods for researchers and practitioners in machine learning, though it is incremental as it builds on prior work.

The paper tackles the problem of quantifying individual data instance influence in deep learning without model training, proposing a training-free complexity-gap score that effectively identifies irregular or mislabeled instances in two-layer overparameterized neural networks.

Many recent works on understanding deep learning try to quantify how much individual data instances influence the optimization and generalization of a model. Such attempts reveal characteristics and importance of individual instances, which may provide useful information in diagnosing and improving deep learning. However, most of the existing works on data valuation require actual training of a model, which often demands high-computational cost. In this paper, we provide a training-free data valuation score, called complexity-gap score, which is a data-centric score to quantify the influence of individual instances in generalization of two-layer overparameterized neural networks. The proposed score can quantify irregularity of the instances and measure how much each data instance contributes in the total movement of the network parameters during training. We theoretically analyze and empirically demonstrate the effectiveness of the complexity-gap score in finding `irregular or mislabeled' data instances, and also provide applications of the score in analyzing datasets and diagnosing training dynamics. Our code is publicly available at https://github.com/JJchy/CG_score

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes