Joint Fusion And Encoding: Advancing Multimodal Retrieval From The Ground Up
2025 Β· Lang Huang, Qiyu Wu, Zhongtao Miao, et al.
Abstract
Information retrieval is indispensable for today's Internet applications, yet traditional semantic matching techniques often fall short in capturing the fine-grained cross-modal interactions required for complex queries. Although late-fusion two-tower architectures attempt to bridge this gap by independently encoding visual and textual data before merging them at a high level, they frequently overlook the subtle interplay essential for comprehensive understanding. In this work, we rigorously assess these limitations and introduce a unified retrieval framework that fuses visual and textual cues from the ground up, enabling early cross-modal interactions for enhancing context interpretation. Through a two-stage training process--comprising post-training adaptation followed by instruction tuning--we adapt MLLMs as retrievers using a simple one-tower architecture. Our approach outperforms conventional methods across diverse retrieval scenarios, particularly when processing complex multi-mo
Authors
(none)
Tags
Stats
Related papers
- Mire: Enhancing Multimodal Queries Representation Via Fusion-free Modality Interaction For Multimodal Retrieval (2024)3.81
- MUST: An Effective And Scalable Framework For Multimodal Search Of Target Modality (2023)7.81
- Revisiting Cross Modal Retrieval (2018)0.00
- Everything At Once -- Multi-modal Fusion Transformer For Video Retrieval (2021)15.78
- Dual Encoding For Video Retrieval By Text (2020)16.05
- Modality Curation: Building Universal Embeddings For Advanced Multimodal Information Retrieval (2025)0.00
- Mm-embed: Universal Multimodal Retrieval With Multimodal Llms (2024)0.00
- The More, The Merrier: Contrastive Fusion For Higher-order Multimodal Alignment (2025)0.00