PolyGen: Fully Synthetic Vision-Language Training via Multi-Generator Ensembles

Leonardo Brusini, Cristian Sbrolli, Eugenio Lomurno, Toshihiko Yamasaki, Matteo Matteucci

arXiv:2602.01370v11.5

Originality Highly original

AI Analysis

This work addresses the scalability and robustness of synthetic data for vision-language pre-training, offering a more data-efficient approach than simply increasing data volume.

The paper tackled the problem of limited feature diversity in synthetic vision-language training by introducing PolyGen, a framework that uses multiple distinct generators to improve manifold coverage, resulting in a +19.0% improvement on multi-task benchmarks and +9.1% on compositionality compared to a single-source baseline.

Synthetic data offers a scalable solution for vision-language pre-training, yet current state-of-the-art methods typically rely on scaling up a single generative backbone, which introduces generator-specific spectral biases and limits feature diversity. In this work, we introduce PolyGen, a framework that redefines synthetic data construction by prioritizing manifold coverage and compositional rigor over simple dataset size. PolyGen employs a Polylithic approach to train on the intersection of architecturally distinct generators, effectively marginalizing out model-specific artifacts. Additionally, we introduce a Programmatic Hard Negative curriculum that enforces fine-grained syntactic understanding. By structurally reallocating the same data budget from unique captions to multi-source variations, PolyGen achieves a more robust feature space, outperforming the leading single-source baseline (SynthCLIP) by +19.0% on aggregate multi-task benchmarks and on the SugarCrepe++ compositionality benchmark (+9.1%). These results demonstrate that structural diversity is a more data-efficient scaling law than simply increasing the volume of single-source samples.

View on arXiv PDF

Similar