CV AI LGSep 29, 2023

Feedback-guided Data Synthesis for Imbalanced Classification

Reyhane Askari Hemmat, Mohammad Pezeshki, Florian Bordes, Michal Drozdzal, Adriana Romero-Soriano

arXiv:2310.00158v219.337 citationsh-index: 27Has Code

Originality Highly original

AI Analysis

This work addresses data imbalance issues in machine learning by enhancing synthetic data generation, offering a novel approach for improving classification performance in domains with long-tailed or group-imbalanced datasets.

The paper tackles the problem of imbalanced classification by introducing a feedback-guided data synthesis framework that uses classifier feedback to generate useful synthetic samples, achieving state-of-the-art results with over 4% improvement on underrepresented classes in ImageNet-LT and over 5% boost in worst group accuracy in NICO++.

Current status quo in machine learning is to use static datasets of real images for training, which often come from long-tailed distributions. With the recent advances in generative models, researchers have started augmenting these static datasets with synthetic data, reporting moderate performance improvements on classification tasks. We hypothesize that these performance gains are limited by the lack of feedback from the classifier to the generative model, which would promote the usefulness of the generated samples to improve the classifier's performance. In this work, we introduce a framework for augmenting static datasets with useful synthetic samples, which leverages one-shot feedback from the classifier to drive the sampling of the generative model. In order for the framework to be effective, we find that the samples must be close to the support of the real data of the task at hand, and be sufficiently diverse. We validate three feedback criteria on a long-tailed dataset (ImageNet-LT) as well as a group-imbalanced dataset (NICO++). On ImageNet-LT, we achieve state-of-the-art results, with over 4 percent improvement on underrepresented classes while being twice efficient in terms of the number of generated synthetic samples. NICO++ also enjoys marked boosts of over 5 percent in worst group accuracy. With these results, our framework paves the path towards effectively leveraging state-of-the-art text-to-image models as data sources that can be queried to improve downstream applications.

View on arXiv PDF Code

Similar