Verve: Versatile Retrieval For Videos Via Unified Embeddings
2026 Β· Shaunak Halbe, Bhagyashree Puranik, Jayakrishnan Unnikrishnan, et al.
Abstract
Modern video retrieval systems are expected to handle diverse tasks ranging from corpus-level retrieval, fine-grained moment localization to flexible multimodal querying. Specialized architectures achieve strong retrieval performance by training modality-specific encoders on massive datasets, but they lack the ability to process composed multimodal queries. In contrast, multimodal LLM (MLLM)-based methods support rich multimodal search but their retrieval performance remains well below that of specialized systems. We present VeRVE, an MLLM-based versatile video retrieval framework that integrates corpus and moment-level retrieval capabilities while accommodating composed multimodal queries within a single architecture. We use contrastive alignment of visual and textual embeddings generated using a shared MLLM backbone to facilitate efficient embedding-based candidate search. Our embedding model, trained efficiently using low-rank adaptation (LoRA) on 700K paired visual-text data sample
Authors
(none)
Tags
Stats
Related papers
- Vidvec: Unlocking Video MLLM Embeddings For Video-text Retrieval (2026)0.00
- Vlm2vec-v2: Advancing Multimodal Embedding For Videos, Images, And Visual Documents (2025)0.00
- Vill-e: Video LLM Embeddings For Retrieval (2026)0.00
- MERLIN: Multimodal Embedding Refinement Via Llm-based Iterative Navigation For Text-video Retrieval-rerank Pipeline (2024)5.84
- Embedding-based Retrieval In Multimodal Content Moderation (2025)2.26
- Evo-retriever: Llm-guided Curriculum Evolution With Viewpoint-pathway Collaboration For Multimodal Document Retrieval (2026)0.00
- Multiple Visual-semantic Embedding For Video Retrieval From Query Sentence (2020)2.26
- Modality-balanced Embedding For Video Retrieval (2022)7.16