Retrieval-augmented Multimodal Language Modeling
2022 Β· Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, et al.
Abstract
Recent multimodal models such as DALL-E and CM3 have achieved remarkable progress in text-to-image and image-to-text generation. However, these models store all learned knowledge (e.g., the appearance of the Eiffel Tower) in the model parameters, requiring increasingly larger models and training data to capture more knowledge. To integrate knowledge in a more scalable and modular way, we propose a retrieval-augmented multimodal model, which enables a base multimodal model (generator) to refer to relevant text and images fetched by a retriever from external memory (e.g., documents on the web). Specifically, for the retriever, we use a pretrained CLIP, and for the generator, we train a CM3 Transformer on the LAION dataset. Our resulting model, named Retrieval-Augmented CM3 (RA-CM3), is the first multimodal model that can retrieve and generate both text and images. We show that RA-CM3 significantly outperforms baseline multimodal models such as DALL-E and CM3 on both image and caption gen
Authors
(none)
Tags
Stats
Related papers
- Murag: Multimodal Retrieval-augmented Generator For Open Question Answering Over Images And Text (2022)14.66
- Lamra: Large Multimodal Model As Your Advanced Retrieval Assistant (2024)7.50
- Generative Cross-modal Retrieval: Memorizing Images In Multimodal Language Models For Retrieval And Beyond (2024)8.35
- Recurrence Meets Transformers For Universal Multimodal Retrieval (2025)2.41
- Tiger: Unifying Text-to-image Generation And Retrieval With Large Multimodal Models (2024)0.00
- CREM: Compression-driven Representation Enhancement For Multimodal Retrieval And Comprehension (2026)0.00
- Towards Retrieval-augmented Architectures For Image Captioning (2024)9.41
- Re-ranking The Context For Multimodal Retrieval Augmented Generation (2025)0.00