LGAIMay 10, 2025

Multi-modal Synthetic Data Training and Model Collapse: Insights from VLMs and Diffusion Models

arXiv:2505.08803v14 citationsh-index: 3
Originality Incremental advance
AI Analysis

This work addresses the risk of performance degradation in self-improving multi-agent AI systems, offering practical guidelines for synthetic data curation, though it is incremental by extending single-modality findings to multi-modal contexts.

The study investigated model collapse in multi-modal vision-language and diffusion models, finding distinct characteristics like improved alignment and increased variance, and identified mitigation strategies such as increased decoding budgets and model diversity.

Recent research has highlighted the risk of generative model collapse, where performance progressively degrades when continually trained on self-generated data. However, existing exploration on model collapse is limited to single, unimodal models, limiting our understanding in more realistic scenarios, such as diverse multi-modal AI agents interacting autonomously through synthetic data and continually evolving. We expand the synthetic data training and model collapse study to multi-modal vision-language generative systems, such as vision-language models (VLMs) and text-to-image diffusion models, as well as recursive generate-train loops with multiple models. We find that model collapse, previously observed in single-modality generative models, exhibits distinct characteristics in the multi-modal context, such as improved vision-language alignment and increased variance in VLM image-captioning task. Additionally, we find that general approaches such as increased decoding budgets, greater model diversity, and relabeling with frozen models can effectively mitigate model collapse. Our findings provide initial insights and practical guidelines for reducing the risk of model collapse in self-improving multi-agent AI systems and curating robust multi-modal synthetic datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes