BRIDGE: Multimodal-to-text Retrieval Via Reinforcement-learned Query Alignment
2026 Β· Mohamed Darwish Mounis, Mohamed Mahmoud, Shaimaa Sedek, et al.
Abstract
Multimodal retrieval systems struggle to resolve image-text queries against text-only corpora: the best vision-language encoder achieves only 27.6 nDCG@10 on MM-BRIGHT, underperforming strong text-only retrievers. We argue the bottleneck is not the retriever but the query -- raw multimodal queries entangle visual descriptions, conversational noise, and retrieval intent in ways that systematically degrade embedding similarity. We present \textbf\{BRIDGE\}, a two-component system that resolves this mismatch without multimodal encoders. \textbf\{FORGE\} (\textbf\{F\}ocused Retrieval Query Generato\textbf\{r\}) is a query alignment model trained via reinforcement learning, which distills noisy multimodal queries into compact, retrieval-optimized search strings. \textbf\{LENS\} (\textbf\{L\}anguage-\textbf\{E\}nhanced \textbf\{N\}eural \textbf\{S\}earch) is a reasoning-enhanced dense retriever fine-tuned on reasoning-intensive retrieval data to handle the intent-rich queries FORGE produces.
Authors
(none)
Tags
Stats
Related papers
- MARVEL: Multimodal Adaptive Reasoning-intensive Expand-rerank And Retrieval (2026)0.00
- Joint Fusion And Encoding: Advancing Multimodal Retrieval From The Ground Up (2025)0.00
- Mire: Enhancing Multimodal Queries Representation Via Fusion-free Modality Interaction For Multimodal Retrieval (2024)3.81
- Multimodal Representation Alignment For Cross-modal Information Retrieval (2025)0.00
- Bridging Video-text Retrieval With Multiple Choice Questions (2022)15.37
- Lexsembridge: Fine-grained Dense Representation Enhancement Through Token-aware Embedding Augmentation (2025)2.35
- Attention Grounded Enhancement For Visual Document Retrieval (2025)0.00
- Evo-retriever: Llm-guided Curriculum Evolution With Viewpoint-pathway Collaboration For Multimodal Document Retrieval (2026)0.00