Vision-Language Model Dialog Games for Self-Improvement
This work addresses the problem of data scarcity for vision-language models, offering a scalable self-improvement method with potential applications in real-world scenarios where multimodal data is limited.
The paper tackles the bottleneck of high-quality training data for vision-language models by introducing VLM Dialog Games, a self-play framework that generates synthetic data through goal-oriented interactions, leading to performance gains on downstream tasks and generalization across datasets.
The increasing demand for high-quality, diverse training data poses a significant bottleneck in advancing vision-language models (VLMs). This paper presents VLM Dialog Games, a novel and scalable self-improvement framework for VLMs. Our approach leverages self-play between two agents engaged in a goal-oriented play centered around image identification. By filtering for successful game interactions, we automatically curate a high-quality dataset of interleaved images and text. We demonstrate that fine-tuning on this synthetic data leads to performance gains on downstream tasks and generalises across datasets. Moreover, as the improvements in the model lead to better game play, this procedure can be applied iteratively. This work paves the way for self-improving VLMs, with potential applications in various real-world scenarios especially when the high-quality multimodal data is scarce.