MARVEL: Unlocking The Multi-modal Capability Of Dense Retrieval Via Visual Module Plugin
2023 Β· Tianshuo Zhou, Sen Mei, Xinze Li, et al.
Abstract
This paper proposes Multi-modAl Retrieval model via Visual modulE pLugin (MARVEL), which learns an embedding space for queries and multi-modal documents to conduct retrieval. MARVEL encodes queries and multi-modal documents with a unified encoder model, which helps to alleviate the modality gap between images and texts. Specifically, we enable the image understanding ability of the well-trained dense retriever, T5-ANCE, by incorporating the visual module's encoded image features as its inputs. To facilitate the multi-modal retrieval tasks, we build the ClueWeb22-MM dataset based on the ClueWeb22 dataset, which regards anchor texts as queries, and extracts the related text and image documents from anchor-linked web pages. Our experiments show that MARVEL significantly outperforms the state-of-the-art methods on the multi-modal retrieval dataset WebQA and ClueWeb22-MM. MARVEL provides an opportunity to broaden the advantages of text retrieval to the multi-modal scenario. Besides, we also
Authors
(none)
Tags
Stats
Related papers
- MARVEL: Multimodal Adaptive Reasoning-intensive Expand-rerank And Retrieval (2026)0.00
- Universal Vision-language Dense Retrieval: Learning A Unified Representation Space For Multi-modal Retrieval (2022)3.45
- Mm-embed: Universal Multimodal Retrieval With Multimodal Llms (2024)0.00
- U-MARVEL: Unveiling Key Factors For Universal Multimodal Retrieval Via Embedding Learning With Mllms (2025)3.11
- REVEAL: Retrieval-augmented Visual-language Pre-training With Multi-source Multimodal Knowledge Memory (2022)13.65
- VISTA: Visualized Text Embedding For Universal Multi-modal Retrieval (2024)16.73
- Mumur : Multilingual Multimodal Universal Retrieval (2022)2.26
- Unlocking Multimodal Document Intelligence: From Current Triumphs To Future Frontiers Of Visual Document Retrieval (2026)0.00