BRIDGE: Multimodal-to-text Retrieval Via Reinforcement-learned Query Alignment

Abstract

Multimodal retrieval systems struggle to resolve image-text queries against text-only corpora: the best vision-language encoder achieves only 27.6 nDCG@10 on MM-BRIGHT, underperforming strong text-only retrievers. We argue the bottleneck is not the retriever but the query -- raw multimodal queries entangle visual descriptions, conversational noise, and retrieval intent in ways that systematically degrade embedding similarity. We present \textbf\{BRIDGE\}, a two-component system that resolves this mismatch without multimodal encoders. \textbf\{FORGE\} (\textbf\{F\}ocused Retrieval Query Generato\textbf\{r\}) is a query alignment model trained via reinforcement learning, which distills noisy multimodal queries into compact, retrieval-optimized search strings. \textbf\{LENS\} (\textbf\{L\}anguage-\textbf\{E\}nhanced \textbf\{N\}eural \textbf\{S\}earch) is a reasoning-enhanced dense retriever fine-tuned on reasoning-intensive retrieval data to handle the intent-rich queries FORGE produces.

BRIDGE: Multimodal-to-text Retrieval Via Reinforcement-learned Query Alignment

Abstract

Authors

Tags

Stats

Related papers