Towards Efficient And Robust Moment Retrieval System: A Unified Framework For Multi-granularity Models And Temporal Reranking
2025 Β· Huu-Loc Tran, Tinh-Anh Nguyen-Nhu, Huu-Phong Phan-Nguyen, et al.
Abstract
Long-form video understanding presents significant challenges for interactive retrieval systems, as conventional methods struggle to process extensive video content efficiently. Existing approaches often rely on single models, inefficient storage, unstable temporal search, and context-agnostic reranking, limiting their effectiveness. This paper presents a novel framework to enhance interactive video retrieval through four key innovations: (1) an ensemble search strategy that integrates coarse-grained (CLIP) and fine-grained (BEIT3) models to improve retrieval accuracy, (2) a storage optimization technique that reduces redundancy by selecting representative keyframes via TransNetV2 and deduplication, (3) a temporal search mechanism that localizes video segments using dual queries for start and end points, and (4) a temporal reranking approach that leverages neighboring frame context to stabilize rankings. Evaluated on known-item search and question-answering tasks, our framework demonst
Authors
(none)
Tags
Stats
Related papers
- A Lightweight Moment Retrieval System With Global Re-ranking And Robust Adaptive Bidirectional Temporal Search (2025)3.58
- Unified Interactive Multimodal Moment Retrieval Via Cascaded Embedding-reranking And Temporal-aware Score Fusion (2025)0.00
- Multimodal Contextualized Support For Enhancing Video Retrieval System (2026)0.00
- Enhanced Multimodal Video Retrieval System: Integrating Query Expansion And Cross-modal Temporal Event Retrieval (2025)0.00
- Viseret: A Simple Yet Effective Approach To Moment Retrieval Via Fine-grained Video Segmentation (2021)0.00
- Madtempo: An Interactive System For Multi-event Temporal Video Retrieval With Query Augmentation (2025)0.00
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00
- Tempme: Video Temporal Token Merging For Efficient Text-video Retrieval (2024)2.86