CVAILGDec 11, 2024

Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training

arXiv:2412.08221v33 citationsh-index: 49Has Code
Originality Highly original
AI Analysis

This addresses the challenge of noisy and weakly compositional datasets for researchers and practitioners in visual generation, enabling better training and evaluation through scalable synthetic data.

The paper tackles the problem of compositional generalization and semantic alignment in text-to-vision generation by introducing Generate Any Scene, a data engine that systematically creates scene graphs for synthetic data generation. Using this approach, they achieved improvements such as a 4% average gain for Stable Diffusion v1.5 over baselines, a 10% increase in TIFA score with fewer than 800 synthetic captions, and surpassing CLIP-based methods by +5% on DPG-Bench.

Recent advances in text-to-vision generation excel in visual fidelity but struggle with compositional generalization and semantic alignment. Existing datasets are noisy and weakly compositional, limiting models' understanding of complex scenes, while scalable solutions for dense, high-quality annotations remain a challenge. We introduce Generate Any Scene, a data engine that systematically enumerates scene graphs representing the combinatorial array of possible visual scenes. Generate Any Scene dynamically constructs scene graphs of varying complexity from a structured taxonomy of objects, attributes, and relations. Given a sampled scene graph, Generate Any Scene translates it into a caption for text-to-image or text-to-video generation; it also translates it into a set of visual question answers that allow automatic evaluation and reward modeling of semantic alignment. Using Generate Any Scene, we first design a self-improving framework where models iteratively enhance their performance using generated data. Stable Diffusion v1.5 achieves an average 4% improvement over baselines and surpassing fine-tuning on CC3M. Second, we also design a distillation algorithm to transfer specific strengths from proprietary models to their open-source counterparts. Using fewer than 800 synthetic captions, we fine-tune Stable Diffusion v1.5 and achieve a 10% increase in TIFA score on compositional and hard concept generation. Third, we create a reward model to align model generation with semantic accuracy at a low cost. Using GRPO algorithm, we fine-tune SimpleAR-0.5B-SFT and surpass CLIP-based methods by +5% on DPG-Bench. Finally, we apply these ideas to the downstream task of content moderation where we train models to identify challenging cases by learning from synthetic data.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes