MUVR: A Multi-modal Untrimmed Video Retrieval Benchmark With Multi-level Visual Correspondence
2025 Β· Yue Feng, Jinwei Hu, Qijia Lu, et al.
Abstract
We propose the Multi-modal Untrimmed Video Retrieval task, along with a new benchmark (MUVR) to advance video retrieval for long-video platforms. MUVR aims to retrieve untrimmed videos containing relevant segments using multi-modal queries. It has the following features: 1) Practical retrieval paradigm: MUVR supports video-centric multi-modal queries, expressing fine-grained retrieval needs through long text descriptions, video tag prompts, and mask prompts. It adopts a one-to-many retrieval paradigm and focuses on untrimmed videos, tailored for long-video platform applications. 2) Multi-level visual correspondence: To cover common video categories (e.g., news, travel, dance) and precisely define retrieval matching criteria, we construct multi-level visual correspondence based on core video content (e.g., news events, travel locations, dance moves) which users are interested in and want to retrieve. It covers six levels: copy, event, scene, instance, action, and others. 3) Comprehensiv
Authors
(none)
Tags
Stats
Related papers
- Mumur : Multilingual Multimodal Universal Retrieval (2022)2.26
- Momentseeker: A Task-oriented Benchmark For Long-video Moment Retrieval (2025)0.00
- Lovr: A Benchmark For Long Video Retrieval In Multimodal Contexts (2025)0.00
- Multivent 2.0: A Massive Multilingual Benchmark For Event-centric Video Retrieval (2024)3.58
- Multimodal Lengthy Videos Retrieval Framework And Evaluation Metric (2025)0.00
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00
- Multi-query Video Retrieval (2022)9.59
- MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion (2025)2.26