MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion
2025 Β· Saron Samuel, Dan Degenaro, Jimena Guallar-Blasco, et al.
Abstract
Videos inherently contain multiple modalities, including visual events, text overlays, sounds, and speech, all of which are important for retrieval. However, state-of-the-art multimodal language models like VAST and LanguageBind are built on vision-language models (VLMs), and thus overly prioritize visual signals. Retrieval benchmarks further reinforce this bias by focusing on visual queries and neglecting other modalities. We create a search system MMMORRF that extracts text and features from both visual and audio modalities and integrates them with a novel modality-aware weighted reciprocal rank fusion. MMMORRF is both effective and efficient, demonstrating practicality in searching videos based on users' information needs instead of visual descriptive queries. We evaluate MMMORRF on MultiVENT 2.0 and TVR, two multimodal benchmarks designed for more targeted information needs, and find that it improves nDCG@20 by 81% over leading multimodal encoders and 37% over single-modality retri
Authors
(none)
Tags
Stats
Related papers
- Clamr: Contextualized Late-interaction For Multimodal Content Retrieval (2025)0.00
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00
- Mumur : Multilingual Multimodal Universal Retrieval (2022)2.26
- Magmar Shared Task System Description: Video Retrieval With Omniembed (2025)0.00
- MERLIN: Multimodal Embedding Refinement Via Llm-based Iterative Navigation For Text-video Retrieval-rerank Pipeline (2024)5.84
- Indexing Multimodal Language Models For Large-scale Image Retrieval (2026)0.00
- MARVEL: Multimodal Adaptive Reasoning-intensive Expand-rerank And Retrieval (2026)0.00
- MDMMT-2: Multidomain Multimodal Transformer For Video Retrieval, One More Step Towards Generalization (2022)0.00