LG AIOct 27, 2025

Eigen-Value: Efficient Domain-Robust Data Valuation via Eigenvalue-Based Approach

Youngjun Choi, Joonseong Kang, Sungjun Lim, Kyungwoo Song

arXiv:2510.23409v2h-index: 2

AI Analysis

This work addresses the need for efficient and robust data valuation in data-centric AI, particularly for large-scale settings with domain shift, though it appears incremental as it builds on existing ID loss-based methods.

The paper tackled the problem of data valuation methods failing to generalize to out-of-distribution (OOD) scenarios due to reliance on in-distribution (ID) settings and high computational costs, by introducing Eigen-Value (EV), a plug-and-play framework that uses spectral approximation to improve OOD robustness and efficiency, achieving stable value rankings across real-world datasets.

Data valuation has become central in the era of data-centric AI. It drives efficient training pipelines and enables objective pricing in data markets by assigning a numeric value to each data point. Most existing data valuation methods estimate the effect of removing individual data points by evaluating changes in model validation performance under in-distribution (ID) settings, as opposed to out-of-distribution (OOD) scenarios where data follow different patterns. Since ID and OOD data behave differently, data valuation methods based on ID loss often fail to generalize to OOD settings, particularly when the validation set contains no OOD data. Furthermore, although OOD-aware methods exist, they involve heavy computational costs, which hinder practical deployment. To address these challenges, we introduce \emph{Eigen-Value} (EV), a plug-and-play data valuation framework for OOD robustness that uses only an ID data subset, including during validation. EV provides a new spectral approximation of domain discrepancy, which is the gap of loss between ID and OOD using ratios of eigenvalues of ID data's covariance matrix. EV then estimates the marginal contribution of each data point to this discrepancy via perturbation theory, alleviating the computational burden. Subsequently, EV plugs into ID loss-based methods by adding an EV term without any additional training loop. We demonstrate that EV achieves improved OOD robustness and stable value rankings across real-world datasets, while remaining computationally lightweight. These results indicate that EV is practical for large-scale settings with domain shift, offering an efficient path to OOD-robust data valuation.

View on arXiv PDF

Similar