CVLGMay 27, 2025

Do We Need All the Synthetic Data? Targeted Synthetic Image Augmentation via Diffusion Models

arXiv:2505.21574v21 citationsh-index: 29
Originality Incremental advance
AI Analysis

This work addresses the challenge of reducing computational costs in synthetic data augmentation for image classification, offering a more efficient alternative to existing methods.

The paper tackles the problem of inefficient synthetic data augmentation by showing that augmenting only 30%-40% of the data, specifically parts not learned early in training, improves generalization by up to 2.8% across various image classifiers and datasets.

Synthetically augmenting training datasets with diffusion models has been an effective strategy for improving generalization of image classifiers. However, existing techniques struggle to ensure the diversity of generation and increase the size of the data by up to 10-30x to improve the in-distribution performance. In this work, we show that synthetically augmenting part of the data that is not learned early in training with faithful images-containing same features but different noise-outperforms augmenting the entire dataset. By analyzing a two-layer CNN, we prove that this strategy improves generalization by promoting homogeneity in feature learning speed without amplifying noise. Our extensive experiments show that by augmenting only 30%-40% of the data, our method boosts generalization by up to 2.8% in a variety of scenarios, including training ResNet, ViT, ConvNeXt, and Swin Transformer on CIFAR-10/100, and TinyImageNet, with various optimizers including SGD and SAM. Notably, our method applied with SGD outperforms the SOTA optimizer, SAM, on CIFAR-100 and TinyImageNet.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes