LG DSMay 20

Efficient Banzhaf-Based Data Valuation for $k$-Nearest Neighbors Classification

Guangyi Zhang, Lutz Oettershagen, Lixu Wang, Aristides Gionis

arXiv:2605.2103317.0

Predicted impact top 60% in LG · last 90 daysOriginality Synthesis-oriented

AI Analysis

For machine learning practitioners needing fair data valuation, this work provides the first efficient exact algorithms for Banzhaf values in kNN, though the contribution is incremental as it adapts existing game-theoretic methods to a specific classifier.

The paper tackles the computational challenge of computing Banzhaf values for data valuation in k-nearest neighbor classifiers, proving #P-hardness but developing a dynamic programming algorithm achieving O(nk^2) time for unweighted kNN and O(Wkn^2) for weighted kNN, with experiments showing practical efficiency.

Data valuation, the task of quantifying the contribution of individual data points to model performance, has emerged as a fundamental challenge in machine learning. Game-theoretic approaches, such as the Banzhaf value, offer principled frameworks for fair data valuation; however, they suffer from exponential computational complexity. We address this challenge by developing efficient algorithms specifically tailored for computing Banzhaf values in $k$-nearest neighbor ($k$NN) classifiers. We first establish the theoretical hardness of the problem by proving that it is \#P-hard. Despite this intractability, we exploit the locality properties of $k$NN classifiers to develop practical exact algorithms. Our main contribution is a dynamic programming framework that achieves significant computational improvements: we present a pseudo-polynomial algorithm with $O(Wkn^2)$ time complexity for weighted $k$NN classifiers, where $W$ is the maximum sum of top-$k$ weights, and a specialized algorithm for unweighted $k$NN that achieves $O(nk^2)$ time complexity, that is, linear in the number of data points. We also offer efficient Monte Carlo estimation methods. Extensive experiments on real-world datasets demonstrate the practical efficiency of our approach and its effectiveness in data valuation applications.

View on arXiv PDF

Similar