Multimodal Lengthy Videos Retrieval Framework And Evaluation Metric
2025 Β· Mohamed Eltahir, Osamah Sarraj, Mohammed Bremoo, et al.
Abstract
Precise video retrieval requires multi-modal correlations to handle unseen vocabulary and scenes, becoming more complex for lengthy videos where models must perform effectively without prior training on a specific dataset. We introduce a unified framework that combines a visual matching stream and an aural matching stream with a unique subtitles-based video segmentation approach. Additionally, the aural stream includes a complementary audio-based two-stage retrieval mechanism that enhances performance on long-duration videos. Considering the complex nature of retrieval from lengthy videos and its corresponding evaluation, we introduce a new retrieval evaluation method specifically designed for long-video retrieval to support further research. We conducted experiments on the YouCook2 benchmark, showing promising retrieval performance.
Authors
(none)
Tags
Stats
Related papers
- Lovr: A Benchmark For Long Video Retrieval In Multimodal Contexts (2025)0.00
- MUVR: A Multi-modal Untrimmed Video Retrieval Benchmark With Multi-level Visual Correspondence (2025)1.40
- Towards Efficient And Robust Moment Retrieval System: A Unified Framework For Multi-granularity Models And Temporal Reranking (2025)2.26
- Multivent 2.0: A Massive Multilingual Benchmark For Event-centric Video Retrieval (2024)3.58
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00
- Enhanced Multimodal Video Retrieval System: Integrating Query Expansion And Cross-modal Temporal Event Retrieval (2025)0.00
- Multimodal Contextualized Support For Enhancing Video Retrieval System (2026)0.00
- A Multimodal Deep Learning Framework For Scalable Content Based Visual Media Retrieval (2021)0.00