CVAug 1, 2025

ROVI: A VLM-LLM Re-Captioned Dataset for Open-Vocabulary Instance-Grounded Text-to-Image Generation

arXiv:2508.01008v1h-index: 14Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for better datasets in open-vocabulary text-to-image generation, offering a scalable solution with broad applications in AI-driven content creation, though it is incremental in building upon existing VLM and LLM technologies.

The paper tackles the problem of generating high-quality, instance-grounded text-to-image datasets by introducing ROVI, a synthetic dataset created from 1M web images using a VLM-LLM re-captioning strategy, which significantly outperforms existing datasets in image quality, resolution, and category diversity, and improves model performance in instance grounding accuracy, prompt fidelity, and aesthetic quality.

We present ROVI, a high-quality synthetic dataset for instance-grounded text-to-image generation, created by labeling 1M curated web images. Our key innovation is a strategy called re-captioning, focusing on the pre-detection stage, where a VLM (Vision-Language Model) generates comprehensive visual descriptions that are then processed by an LLM (Large Language Model) to extract a flat list of potential categories for OVDs (Open-Vocabulary Detectors) to detect. This approach yields a global prompt inherently linked to instance annotations while capturing secondary visual elements humans typically overlook. Evaluations show that ROVI exceeds existing detection datasets in image quality and resolution while containing two orders of magnitude more categories with an open-vocabulary nature. For demonstrative purposes, a text-to-image model GLIGEN trained on ROVI significantly outperforms state-of-the-art alternatives in instance grounding accuracy, prompt fidelity, and aesthetic quality. Our dataset and reproducible pipeline are available at https://github.com/CihangPeng/ROVI.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes