CVNov 27, 2024

Training Data Synthesis with Difficulty Controlled Diffusion Model

arXiv:2411.18109v12 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses a growing issue in SSL for computer vision as synthetic images become more common, though it is incremental as it adapts existing SSL methods to a new contamination scenario.

The paper tackles the problem of semi-supervised learning (SSL) being negatively affected by synthetic images contaminating unlabeled data, proposing RSMatch to identify and utilize these images, which transforms them from obstacles into resources for improvement.

Semi-supervised learning (SSL) can improve model performance by leveraging unlabeled images, which can be collected from public image sources with low costs. In recent years, synthetic images have become increasingly common in public image sources due to rapid advances in generative models. Therefore, it is becoming inevitable to include existing synthetic images in the unlabeled data for SSL. How this kind of contamination will affect SSL remains unexplored. In this paper, we introduce a new task, Real-Synthetic Hybrid SSL (RS-SSL), to investigate the impact of unlabeled data contaminated by synthetic images for SSL. First, we set up a new RS-SSL benchmark to evaluate current SSL methods and found they struggled to improve by unlabeled synthetic images, sometimes even negatively affected. To this end, we propose RSMatch, a novel SSL method specifically designed to handle the challenges of RS-SSL. RSMatch effectively identifies unlabeled synthetic data and further utilizes them for improvement. Extensive experimental results show that RSMatch can transfer synthetic unlabeled data from `obstacles' to `resources.' The effectiveness is further verified through ablation studies and visualization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes