Logan: Latent Graph Co-attention Network For Weakly-supervised Video Moment Retrieval
2019 Β· Reuben Tan, Huijuan Xu, Kate Saenko, et al.
Abstract
The goal of weakly-supervised video moment retrieval is to localize the video segment most relevant to the given natural language query without access to temporal annotations during training. Prior strongly- and weakly-supervised approaches often leverage co-attention mechanisms to learn visual-semantic representations for localization. However, while such approaches tend to focus on identifying relationships between elements of the video and language modalities, there is less emphasis on modeling relational context between video frames given the semantic context of the query. Consequently, the above-mentioned visual-semantic representations, built upon local frame features, do not contain much contextual information. To address this limitation, we propose a Latent Graph Co-Attention Network (LoGAN) that exploits fine-grained frame-by-word interactions to reason about correspondences between all possible pairs of frames, given the semantic context of the query. Comprehensive experiment
Authors
(none)
Tags
Stats
Related papers
- Vlanet: Video-language Alignment Network For Weakly-supervised Video Moment Retrieval (2020)13.28
- Frame-wise Cross-modal Matching For Video Moment Retrieval (2020)13.17
- Text-based Localization Of Moments In A Video Corpus (2020)10.35
- Context-enhanced Video Moment Retrieval With Large Language Models (2024)5.84
- Video Moment Retrieval With Text Query Considering Many-to-many Correspondence Using Potentially Relevant Pair (2021)0.00
- Query-centric Audio-visual Cognition Network For Moment Retrieval, Segmentation And Step-captioning (2024)3.58
- Disentangle And Denoise: Tackling Context Misalignment For Video Moment Retrieval (2024)0.00
- Hybrid-learning Video Moment Retrieval Across Multi-domain Labels (2024)0.00