Clamr: Contextualized Late-interaction For Multimodal Content Retrieval
2025 Β· David Wan, Han Wang, Elias Stengel-Eskin, et al.
Abstract
Online video web content is richly multimodal: a single video blends vision, speech, ambient audio, and on-screen text. Retrieval systems typically treat these modalities as independent retrieval sources, which can lead to noisy and subpar retrieval. We explore multimodal video content retrieval, where relevance can be scored from one particular modality or jointly across multiple modalities simultaneously. Consequently, an effective retriever must dynamically choose which modality (or set of modalities) best addresses the query. We introduce CLaMR, a multimodal, late-interaction retriever that jointly indexes 4 modalities: video frames, transcribed speech, on-screen text, and metadata. CLaMR jointly encodes all modalities with a unified multimodal backbone for improved contextualization and is trained to enhance dynamic modality selection via two key innovations. First, given the lack of training data for multimodal retrieval, we introduce MultiVENT 2.0++, a large-scale synthetic trai
Authors
(none)
Tags
Stats
Related papers
- Multimodal Contextualized Support For Enhancing Video Retrieval System (2026)0.00
- MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion (2025)2.26
- Mm-embed: Universal Multimodal Retrieval With Multimodal Llms (2024)0.00
- Embedding-based Retrieval In Multimodal Content Moderation (2025)2.26
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00
- Enhanced Multimodal Video Retrieval System: Integrating Query Expansion And Cross-modal Temporal Event Retrieval (2025)0.00
- Composed Multi-modal Retrieval: A Survey Of Approaches And Applications (2025)3.88
- Generalized Contrastive Learning For Universal Multimodal Retrieval (2025)0.00