CVNov 25, 2025

VQ-VA World: Towards High-Quality Visual Question-Visual Answering

arXiv:2511.20573v13 citationsHas Code
Originality Incremental advance
AI Analysis

It addresses the challenge of open-source visual question-visual answering, enabling image generation from visual queries, though it is incremental as it builds on existing capabilities.

This paper tackles the problem of generating images in response to visual questions, introducing VQ-VA World, a data-centric framework that constructs a large-scale dataset of ~1.8M samples, resulting in LightFusion achieving 53.06 on a new benchmark, significantly outperforming prior open-source baselines and narrowing the gap to proprietary systems.

This paper studies Visual Question-Visual Answering (VQ-VA): generating an image, rather than text, in response to a visual question -- an ability that has recently emerged in proprietary systems such as NanoBanana and GPT-Image. To also bring this capability to open-source models, we introduce VQ-VA World, a data-centric framework built around an agentic pipeline for large-scale, targeted data construction. Leveraging web-scale deployment, this pipeline crawls a massive amount of ~1.8M high-quality, interleaved image-text samples for model training. For evaluation, we further release IntelligentBench, a human-curated benchmark that systematically assesses VQ-VA along the aspects of world knowledge, design knowledge, and reasoning. Training with VQ-VA World data yields strong empirical gains: it helps LightFusion attain 53.06 on IntelligentBench, substantially surpassing the best prior open-source baselines (i.e., 7.78 from vanilla LightFusion; 1.94 from UniWorld-V1), and significantly narrowing the gap toward leading proprietary systems (e.g., 81.67 from NanoBanana; 82.64 from GPT-Image). By releasing the full suite of model weights, datasets, and pipelines, we hope to stimulate future research on VQ-VA.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes