LGMLNov 19, 2022

Evaluating Synthetic Tabular Data Generated To Augment Small Sample Datasets

arXiv:2211.10760v53 citationsh-index: 2
Originality Synthesis-oriented
AI Analysis

This addresses the challenge of reliably validating synthetic data for machine learning applications with limited samples, though it is incremental in proposing multi-faceted evaluation rather than a new solution.

The paper tackles the problem of evaluating synthetic tabular data for augmenting small datasets, finding that common metrics like propensity scoring and MMD often fail to detect topological differences, while topological measures like normalized Bottleneck distance show high variability and instability.

This work proposes a method to evaluate synthetic tabular data generated to augment small sample datasets. While data augmentation techniques can increase sample counts for machine learning applications, traditional validation approaches fail when applied to extremely limited sample sizes. Our experiments across four datasets reveal significant inconsistencies between global metrics and topological measures, with statistical tests producing unreliable significance values due to insufficient sample sizes. We demonstrate that common metrics like propensity scoring and MMD often suggest similarity where fundamental topological differences exist. Our proposed normalized Bottleneck distance based metric provides complementary insights but suffers from high variability across experimental runs and occasional values exceeding theoretical bounds, showing inherent instability in topological approaches for very small datasets. These findings highlight the critical need for multi-faceted evaluation methodologies when validating synthetic data generated from limited samples, as no single metric reliably captures both distributional and structural similarity.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes