Unified Interactive Multimodal Moment Retrieval Via Cascaded Embedding-reranking And Temporal-aware Score Fusion
2025 Β· Toan Le Ngo Thanh, Phat Ha Huu, Tan Nguyen Dang Duy, et al.
Abstract
The exponential growth of video content has created an urgent need for efficient multimodal moment retrieval systems. However, existing approaches face three critical challenges: (1) fixed-weight fusion strategies fail across cross modal noise and ambiguous queries, (2) temporal modeling struggles to capture coherent event sequences while penalizing unrealistic gaps, and (3) systems require manual modality selection, reducing usability. We propose a unified multimodal moment retrieval system with three key innovations. First, a cascaded dual-embedding pipeline combines BEIT-3 and SigLIP for broad retrieval, refined by BLIP-2 based reranking to balance recall and precision. Second, a temporal-aware scoring mechanism applies exponential decay penalties to large temporal gaps via beam search, constructing coherent event sequences rather than isolated frames. Third, Agent-guided query decomposition (GPT-4o) automatically interprets ambiguous queries, decomposes them into modality specific
Authors
(none)
Tags
Stats
Related papers
- Towards Efficient And Robust Moment Retrieval System: A Unified Framework For Multi-granularity Models And Temporal Reranking (2025)2.26
- A Lightweight Moment Retrieval System With Global Re-ranking And Robust Adaptive Bidirectional Temporal Search (2025)3.58
- Enhanced Multimodal Video Retrieval System: Integrating Query Expansion And Cross-modal Temporal Event Retrieval (2025)0.00
- Frame-wise Cross-modal Matching For Video Moment Retrieval (2020)13.17
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00
- Multimodal Contextualized Support For Enhancing Video Retrieval System (2026)0.00
- Embedding-based Retrieval In Multimodal Content Moderation (2025)2.26
- Joint Fusion And Encoding: Advancing Multimodal Retrieval From The Ground Up (2025)0.00