LG AIMar 8, 2024

"What is Different Between These Datasets?" A Framework for Explaining Data Distribution Shifts

Varun Babbar, Zhicheng Guo, Cynthia Rudin

arXiv:2403.05652v39.26 citationsh-index: 5

Originality Incremental advance

AI Analysis

This addresses the issue for machine learning practitioners who need actionable insights to understand and mitigate distribution shifts, though it is incremental as it builds on existing detection techniques.

The paper tackles the problem of explaining data distribution shifts between datasets in a human-understandable way, proposing a versatile framework that demonstrates effectiveness across diverse data modalities like tabular data, text, images, and time-series in both low and high-dimensional settings.

The performance of machine learning models relies heavily on the quality of input data, yet real-world applications often face significant data-related challenges. A common issue arises when curating training data or deploying models: two datasets from the same domain may exhibit differing distributions. While many techniques exist for detecting such distribution shifts, there is a lack of comprehensive methods to explain these differences in a human-understandable way beyond opaque quantitative metrics. To bridge this gap, we propose a versatile framework of interpretable methods for comparing datasets. Using a variety of case studies, we demonstrate the effectiveness of our approach across diverse data modalities-including tabular data, text data, images, time-series signals -- in both low and high-dimensional settings. These methods complement existing techniques by providing actionable and interpretable insights to better understand and address distribution shifts.

View on arXiv PDF

Similar