CV AI CL LGMay 28, 2025

Cross-modal RAG: Sub-dimensional Text-to-Image Retrieval-Augmented Generation

Mengdan Zhu, Senhao Cheng, Guangji Bai, Yifei Zhang, Liang Zhao

arXiv:2505.21956v33.61 citationsh-index: 10Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of generating images from complex queries where no single retrieved image contains all desired elements, offering a domain-specific solution for multimodal AI applications.

The paper tackles the problem of text-to-image generation requiring fine-grained knowledge by proposing Cross-modal RAG, which decomposes queries and images into sub-dimensional components for retrieval and generation, achieving significant performance improvements over baselines on multiple datasets.

Text-to-image generation increasingly demands access to domain-specific, fine-grained, and rapidly evolving knowledge that pretrained models cannot fully capture, necessitating the integration of retrieval methods. Existing Retrieval-Augmented Generation (RAG) methods attempt to address this by retrieving globally relevant images, but they fail when no single image contains all desired elements from a complex user query. We propose Cross-modal RAG, a novel framework that decomposes both queries and images into sub-dimensional components, enabling subquery-aware retrieval and generation. Our method introduces a hybrid retrieval strategy - combining a sub-dimensional sparse retriever with a dense retriever - to identify a Pareto-optimal set of images, each contributing complementary aspects of the query. During generation, a multimodal large language model is guided to selectively condition on relevant visual features aligned to specific subqueries, ensuring subquery-aware image synthesis. Extensive experiments on MS-COCO, Flickr30K, WikiArt, CUB, and ImageNet-LT demonstrate that Cross-modal RAG significantly outperforms existing baselines in the retrieval and further contributes to generation quality, while maintaining high efficiency.

View on arXiv PDF Code

Similar