Cross-modal RAG: Sub-dimensional Text-to-image Retrieval-augmented Generation
2025 Β· Mengdan Zhu, Senhao Cheng, Guangji Bai, et al.
Abstract
Text-to-image generation increasingly demands access to domain-specific, fine-grained, and rapidly evolving knowledge that pretrained models cannot fully capture, necessitating the integration of retrieval methods. Existing Retrieval-Augmented Generation (RAG) methods attempt to address this by retrieving globally relevant images, but they fail when no single image contains all desired elements from a complex user query. We propose Cross-modal RAG, a novel framework that decomposes both queries and images into sub-dimensional components, enabling subquery-aware retrieval and generation. Our method introduces a hybrid retrieval strategy - combining a sub-dimensional sparse retriever with a dense retriever - to identify a Pareto-optimal set of images, each contributing complementary aspects of the query. During generation, a multimodal large language model is guided to selectively condition on relevant visual features aligned to specific subqueries, ensuring subquery-aware image synthesi
Authors
(none)
Tags
Stats
Related papers
- Visual-rag: Benchmarking Text-to-image Retrieval Augmented Generation For Visual Knowledge Intensive Queries (2025)0.00
- Universalrag: Retrieval-augmented Generation Over Corpora Of Diverse Modalities And Granularities (2025)0.00
- Regionrag: Region-level Retrieval-augmented Generation For Visual Document Understanding (2025)0.00
- Murag: Multimodal Retrieval-augmented Generator For Open Question Answering Over Images And Text (2022)14.66
- Imagerag: Dynamic Image Retrieval For Reference-guided Image Generation (2025)0.00
- Re-ranking The Context For Multimodal Retrieval Augmented Generation (2025)0.00
- MG\(^2\)-RAG: Multi-granularity Graph For Multimodal Retrieval-augmented Generation (2026)0.00
- AR-RAG: Autoregressive Retrieval Augmentation For Image Generation (2025)0.00