Mire: Enhancing Multimodal Queries Representation Via Fusion-free Modality Interaction For Multimodal Retrieval
2024 Β· Yeong-Joon Ju, Ho-Joong Kim, Seong-Whan Lee
Abstract
Recent multimodal retrieval methods have endowed text-based retrievers with multimodal capabilities by utilizing pre-training strategies for visual-text alignment. They often directly fuse the two modalities for cross-reference during the alignment to understand multimodal queries. However, existing methods often overlook crucial visual information due to a text-dominant issue, which overly depends on text-driven signals. In this paper, we introduce MIRe, a retrieval framework that achieves modality interaction without fusing textual features during the alignment. Our method allows the textual query to attend to visual embeddings while not feeding text-driven signals back into the visual representations. Additionally, we construct a pre-training dataset for multimodal query retrieval by transforming concise question-answer pairs into extended passages. Our experiments demonstrate that our pre-training strategy significantly enhances the understanding of multimodal queries, resulting in
Authors
(none)
Tags
Stats
Related papers
- Joint Fusion And Encoding: Advancing Multimodal Retrieval From The Ground Up (2025)0.00
- Modality Curation: Building Universal Embeddings For Advanced Multimodal Information Retrieval (2025)0.00
- MUST: An Effective And Scalable Framework For Multimodal Search Of Target Modality (2023)7.81
- RETLLM: Training And Data-free Mllms For Multimodal Information Retrieval (2026)1.57
- IDMR: Towards Instance-driven Precise Visual Correspondence In Multimodal Retrieval (2025)2.29
- Revisiting Cross Modal Retrieval (2018)0.00
- Mm-embed: Universal Multimodal Retrieval With Multimodal Llms (2024)0.00
- MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion (2025)2.26