Mm-embed: Universal Multimodal Retrieval With Multimodal Llms
2024 Β· Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, et al.
Abstract
State-of-the-art retrieval models typically address a straightforward search scenario, in which retrieval tasks are fixed (e.g., finding a passage to answer a specific question) and only a single modality is supported for both queries and retrieved results. This paper introduces techniques for advancing information retrieval with multimodal large language models (MLLMs), enabling a broader search scenario, termed universal multimodal retrieval, where multiple modalities and diverse retrieval tasks are accommodated. To this end, we first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks. Our empirical results show that the fine-tuned MLLM retriever is capable of understanding challenging queries, composed of both text and image, but it underperforms compared to a smaller CLIP retriever in cross-modal retrieval tasks due to the modality bias exhibited by MLLMs. To address the issue, we propose modality-aware hard negative mining to mitigate the mo
Authors
(none)
Tags
Stats
Related papers
- Indexing Multimodal Language Models For Large-scale Image Retrieval (2026)0.00
- RETLLM: Training And Data-free Mllms For Multimodal Information Retrieval (2026)1.57
- Lamra: Large Multimodal Model As Your Advanced Retrieval Assistant (2024)7.50
- GME: Improving Universal Multimodal Retrieval By Multimodal Llms (2024)0.00
- U-MARVEL: Unveiling Key Factors For Universal Multimodal Retrieval Via Embedding Learning With Mllms (2025)3.11
- Breaking The Modality Barrier: Universal Embedding Learning With Multimodal Llms (2025)4.52
- CREM: Compression-driven Representation Enhancement For Multimodal Retrieval And Comprehension (2026)0.00
- Generative Giants, Retrieval Weaklings: Why Do Multimodal Large Language Models Fail At Multimodal Retrieval? (2025)0.00