LGMLJan 12, 2025

Tab-Shapley: Identifying Top-k Tabular Data Quality Insights

arXiv:2501.06685v11 citationsh-index: 14AAAI
Originality Incremental advance
AI Analysis

This addresses the challenge of detecting anomalies in tabular data for users dealing with unlabeled, complex datasets, though it is incremental as it builds on cooperative game theory.

The paper tackles the problem of identifying top-k data quality insights in tabular datasets by introducing Tab-Shapley, an unsupervised method that uses Shapley values to quantify attribute contributions to anomalies, and validates it with empirical analysis on real-world datasets.

We present an unsupervised method for aggregating anomalies in tabular datasets by identifying the top-k tabular data quality insights. Each insight consists of a set of anomalous attributes and the corresponding subsets of records that serve as evidence to the user. The process of identifying these insight blocks is challenging due to (i) the absence of labeled anomalies, (ii) the exponential size of the subset search space, and (iii) the complex dependencies among attributes, which obscure the true sources of anomalies. Simple frequency-based methods fail to capture these dependencies, leading to inaccurate results. To address this, we introduce Tab-Shapley, a cooperative game theory based framework that uses Shapley values to quantify the contribution of each attribute to the data's anomalous nature. While calculating Shapley values typically requires exponential time, we show that our game admits a closed-form solution, making the computation efficient. We validate the effectiveness of our approach through empirical analysis on real-world tabular datasets with ground-truth anomaly labels.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes