Hybrid-learning Video Moment Retrieval Across Multi-domain Labels
2024 Β· Weitong Cai, Jiabo Huang, Shaogang Gong
Abstract
Video moment retrieval (VMR) is to search for a visual temporal moment in an untrimmed raw video by a given text query description (sentence). Existing studies either start from collecting exhaustive frame-wise annotations on the temporal boundary of target moments (fully-supervised), or learn with only the video-level video-text pairing labels (weakly-supervised). The former is poor in generalisation to unknown concepts and/or novel scenes due to restricted dataset scale and diversity under expensive annotation costs; the latter is subject to visual-textual mis-correlations from incomplete labels. In this work, we introduce a new approach called hybrid-learning video moment retrieval to solve the problem by knowledge transfer through adapting the video-text matching relationships learned from a fully-supervised source domain to a weakly-labelled target domain when they do not share a common label space. Our aim is to explore shared universal knowledge between the two domains in order
Authors
(none)
Tags
Stats
Related papers
- Vlanet: Video-language Alignment Network For Weakly-supervised Video Moment Retrieval (2020)13.28
- Frame-wise Cross-modal Matching For Video Moment Retrieval (2020)13.17
- Context-enhanced Video Moment Retrieval With Large Language Models (2024)5.84
- Audio Does Matter: Importance-aware Multi-granularity Fusion For Video Moment Retrieval (2025)4.49
- MVMR: A New Framework For Evaluating Faithfulness Of Video Moment Retrieval Against Multiple Distractors (2023)1.40
- Viseret: A Simple Yet Effective Approach To Moment Retrieval Via Fine-grained Video Segmentation (2021)0.00
- When One Moment Isn't Enough: Multi-moment Retrieval With Cross-moment Interactions (2025)1.81
- Video Moment Retrieval With Text Query Considering Many-to-many Correspondence Using Potentially Relevant Pair (2021)0.00