LG AIOct 24, 2022

Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular data

Nabeel Seedat, Jonathan Crabbé, Ioana Bica, Mihaela van der Schaar

CambridgeOxford

arXiv:2210.13043v118.538 citationsh-index: 74Has Code

Originality Incremental advance

AI Analysis

This addresses reliability issues in predictions for domains like healthcare where similar features can lead to different outcomes, offering a flexible framework for any ML model.

The paper tackles the problem of models underperforming on subgroups due to outcome heterogeneity in tabular data, proposing Data-IQ to stratify examples into Easy, Ambiguous, and Hard subgroups based on predictive confidence and aleatoric uncertainty, and demonstrates its robustness on four medical datasets with applications in feature acquisition and dataset selection.

High model performance, on average, can hide that models may systematically underperform on subgroups of the data. We consider the tabular setting, which surfaces the unique issue of outcome heterogeneity - this is prevalent in areas such as healthcare, where patients with similar features can have different outcomes, thus making reliable predictions challenging. To tackle this, we propose Data-IQ, a framework to systematically stratify examples into subgroups with respect to their outcomes. We do this by analyzing the behavior of individual examples during training, based on their predictive confidence and, importantly, the aleatoric (data) uncertainty. Capturing the aleatoric uncertainty permits a principled characterization and then subsequent stratification of data examples into three distinct subgroups (Easy, Ambiguous, Hard). We experimentally demonstrate the benefits of Data-IQ on four real-world medical datasets. We show that Data-IQ's characterization of examples is most robust to variation across similarly performant (yet different) models, compared to baselines. Since Data-IQ can be used with any ML model (including neural networks, gradient boosting etc.), this property ensures consistency of data characterization, while allowing flexible model selection. Taking this a step further, we demonstrate that the subgroups enable us to construct new approaches to both feature acquisition and dataset selection. Furthermore, we highlight how the subgroups can inform reliable model usage, noting the significant impact of the Ambiguous subgroup on model generalization.

View on arXiv PDF Code

Similar