Abstract

Multimodal retrieval systems struggle to resolve image-text queries against text-only corpora: the best vision-language encoder achieves only 27.6 nDCG@10 on MM-BRIGHT, underperforming strong text-only retrievers. We argue the bottleneck is not the retriever but the query -- raw multimodal queries entangle visual descriptions, conversational noise, and retrieval intent in ways that systematically degrade embedding similarity. We present \textbf\{BRIDGE\}, a two-component system that resolves this mismatch without multimodal encoders. \textbf\{FORGE\} (\textbf\{F\}ocused Retrieval Query Generato\textbf\{r\}) is a query alignment model trained via reinforcement learning, which distills noisy multimodal queries into compact, retrieval-optimized search strings. \textbf\{LENS\} (\textbf\{L\}anguage-\textbf\{E\}nhanced \textbf\{N\}eural \textbf\{S\}earch) is a reasoning-enhanced dense retriever fine-tuned on reasoning-intensive retrieval data to handle the intent-rich queries FORGE produces.

Authors

(none)

Tags

  • Image Retrieval

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keymounis2026bridge

Related papers