Vlanet: Video-language Alignment Network For Weakly-supervised Video Moment Retrieval
2020 Β· Minuk Ma, Sunjae Yoon, Junyeong Kim, et al.
Abstract
Video Moment Retrieval (VMR) is a task to localize the temporal moment in untrimmed video specified by natural language query. For VMR, several methods that require full supervision for training have been proposed. Unfortunately, acquiring a large number of training videos with labeled temporal boundaries for each query is a labor-intensive process. This paper explores methods for performing VMR in a weakly-supervised manner (wVMR): training is performed without temporal moment labels but only with the text query that describes a segment of the video. Existing methods on wVMR generate multi-scale proposals and apply query-guided attention mechanisms to highlight the most relevant proposal. To leverage the weak supervision, contrastive learning is used which predicts higher scores for the correct video-query pairs than for the incorrect pairs. It has been observed that a large number of candidate proposals, coarse query representation, and one-way attention mechanism lead to blurry atte
Authors
(none)
Tags
Stats
Related papers
- Hybrid-learning Video Moment Retrieval Across Multi-domain Labels (2024)0.00
- Context-enhanced Video Moment Retrieval With Large Language Models (2024)5.84
- Frame-wise Cross-modal Matching For Video Moment Retrieval (2020)13.17
- Towards Balanced Alignment: Modal-enhanced Semantic Modeling For Video Moment Retrieval (2023)14.33
- Granalign: Granularity-aware Alignment Framework For Zero-shot Video Moment Retrieval (2026)0.00
- Logan: Latent Graph Co-attention Network For Weakly-supervised Video Moment Retrieval (2019)13.05
- Selective Query-guided Debiasing For Video Corpus Moment Retrieval (2022)9.59
- Video Corpus Moment Retrieval With Contrastive Learning (2021)14.35