MV-RAG: Retrieval Augmented Multiview Diffusion
2025 Β· Yosef Dayani, Omer Benishu, Sagie Benaim
Abstract
Text-to-3D generation approaches have advanced significantly by leveraging pretrained 2D diffusion priors, producing high-quality and 3D-consistent outputs. However, they often fail to produce out-of-domain (OOD) or rare concepts, yielding inconsistent or inaccurate results. To this end, we propose MV-RAG, a novel text-to-3D pipeline that first retrieves relevant 2D images from a large in-the-wild 2D database and then conditions a multiview diffusion model on these images to synthesize consistent and accurate multiview outputs. Training such a retrieval-conditioned model is achieved via a novel hybrid strategy bridging structured multiview data and diverse 2D image collections. This involves training on multiview data using augmented conditioning views that simulate retrieval variance for view-specific reconstruction, alongside training on sets of retrieved real-world 2D images using a distinctive held-out view prediction objective: the model predicts the held-out view from the other v
Authors
(none)
Tags
Stats
Related papers
- Imagerag: Dynamic Image Retrieval For Reference-guided Image Generation (2025)0.00
- Eliminating Hallucination In Diffusion-augmented Interactive Text-to-image Retrieval (2026)0.00
- Cross-modal RAG: Sub-dimensional Text-to-image Retrieval-augmented Generation (2025)0.00
- Text-guided Synthesis Of Artistic Images With Retrieval-augmented Diffusion Models (2022)8.29
- Regionrag: Region-level Retrieval-augmented Generation For Visual Document Understanding (2025)0.00
- Diff-sbsr: Learning Multimodal Feature-enhanced Diffusion Models For Zero-shot Sketch-based 3D Shape Retrieval (2026)0.00
- Vdocrag: Retrieval-augmented Generation Over Visually-rich Documents (2025)6.34
- MG\(^2\)-RAG: Multi-granularity Graph For Multimodal Retrieval-augmented Generation (2026)0.00