CVAICLLGMay 28, 2025

Cross-modal RAG: Sub-dimensional Text-to-Image Retrieval-Augmented Generation

arXiv:2505.21956v31 citationsh-index: 10Has Code
Originality Incremental advance
AI Analysis

This addresses the challenge of generating images from complex queries where no single retrieved image contains all desired elements, offering a domain-specific solution for multimodal AI applications.

The paper tackles the problem of text-to-image generation requiring fine-grained knowledge by proposing Cross-modal RAG, which decomposes queries and images into sub-dimensional components for retrieval and generation, achieving significant performance improvements over baselines on multiple datasets.

Text-to-image generation increasingly demands access to domain-specific, fine-grained, and rapidly evolving knowledge that pretrained models cannot fully capture, necessitating the integration of retrieval methods. Existing Retrieval-Augmented Generation (RAG) methods attempt to address this by retrieving globally relevant images, but they fail when no single image contains all desired elements from a complex user query. We propose Cross-modal RAG, a novel framework that decomposes both queries and images into sub-dimensional components, enabling subquery-aware retrieval and generation. Our method introduces a hybrid retrieval strategy - combining a sub-dimensional sparse retriever with a dense retriever - to identify a Pareto-optimal set of images, each contributing complementary aspects of the query. During generation, a multimodal large language model is guided to selectively condition on relevant visual features aligned to specific subqueries, ensuring subquery-aware image synthesis. Extensive experiments on MS-COCO, Flickr30K, WikiArt, CUB, and ImageNet-LT demonstrate that Cross-modal RAG significantly outperforms existing baselines in the retrieval and further contributes to generation quality, while maintaining high efficiency.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes