Frame-wise Cross-modal Matching For Video Moment Retrieval
2020 Β· Haoyu Tang, Jihua Zhu, Meng Liu, et al.
Abstract
Video moment retrieval targets at retrieving a moment in a video for a given language query. The challenges of this task include 1) the requirement of localizing the relevant moment in an untrimmed video, and 2) bridging the semantic gap between textual query and video contents. To tackle those problems, early approaches adopt the sliding window or uniform sampling to collect video clips first and then match each clip with the query. Obviously, these strategies are time-consuming and often lead to unsatisfied accuracy in localization due to the unpredictable length of the golden moment. To avoid the limitations, researchers recently attempt to directly predict the relevant moment boundaries without the requirement to generate video clips first. One mainstream approach is to generate a multimodal feature vector for the target query and video frames (e.g., concatenation) and then use a regression approach upon the multimodal feature vector for boundary detection. Although some progress h
Authors
(none)
Tags
Stats
Related papers
- Hybrid-learning Video Moment Retrieval Across Multi-domain Labels (2024)0.00
- Vlanet: Video-language Alignment Network For Weakly-supervised Video Moment Retrieval (2020)13.28
- Towards Balanced Alignment: Modal-enhanced Semantic Modeling For Video Moment Retrieval (2023)14.33
- Viseret: A Simple Yet Effective Approach To Moment Retrieval Via Fine-grained Video Segmentation (2021)0.00
- Semantic Video Moments Retrieval At Scale: A New Task And A Baseline (2022)0.00
- Context-enhanced Video Moment Retrieval With Large Language Models (2024)5.84
- When One Moment Isn't Enough: Multi-moment Retrieval With Cross-moment Interactions (2025)1.81
- Unified Interactive Multimodal Moment Retrieval Via Cascaded Embedding-reranking And Temporal-aware Score Fusion (2025)0.00