CVMar 3, 2025

Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data

arXiv:2503.01167v211 citationsh-index: 27CVPR
Originality Incremental advance
AI Analysis

This work addresses the problem of data scarcity for compositional learning in vision-language models, offering an incremental improvement through synthetic data generation.

The paper tackled the challenge of synthesizing multimodal training data for vision-language compositional understanding by proposing SPARCL, which improved CLIP's average accuracy by over 8% across benchmarks and outperformed state-of-the-art methods by 2% on three benchmarks.

Paired image-text data with subtle variations in-between (e.g., people holding surfboards vs. people holding shovels) hold the promise of producing Vision-Language Models with proper compositional understanding. Synthesizing such training data from generative models is a highly coveted prize due to the reduced cost of data collection. However, synthesizing training images for compositional learning presents three challenges: (1) efficiency in generating large quantities of images, (2) text alignment between the generated image and the caption in the exact place of the subtle change, and (3) image fidelity in ensuring sufficient similarity with the original real images in all other places. We propose SPARCL (Synthetic Perturbations for Advancing Robust Compositional Learning), which integrates image feature injection into a fast text-to-image generative model, followed by an image style transfer step, to meet the three challenges. Further, to cope with any residual issues of text alignment, we propose an adaptive margin loss to filter out potentially incorrect synthetic samples and focus the learning on informative hard samples. Evaluation on four compositional understanding benchmarks demonstrates that SPARCL significantly improves the compositionality of CLIP, boosting the average accuracy of the CLIP base model by over 8% across all benchmarks and outperforming state-of-the-art methods by 2% on three benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes