Locvtp: Video-text Pre-training For Temporal Localization
2022 Β· Meng Cao, Tianyu Yang, Junwu Weng, et al.
Abstract
Video-Text Pre-training (VTP) aims to learn transferable representations for various downstream tasks from large-scale web videos. To date, almost all existing VTP methods are limited to retrieval-based downstream tasks, e.g., video retrieval, whereas their transfer potentials on localization-based tasks, e.g., temporal grounding, are under-explored. In this paper, we experimentally analyze and demonstrate the incompatibility of current VTP methods with localization tasks, and propose a novel Localization-oriented Video-Text Pre-training framework, dubbed as LocVTP. Specifically, we perform the fine-grained contrastive alignment as a complement to the coarse-grained one by a clip-word correspondence discovery scheme. To further enhance the temporal reasoning ability of the learned feature, we propose a context projection head and a temporal aware contrastive loss to perceive the contextual relationships. Extensive experiments on four downstream tasks across six datasets demonstrate tha
Authors
(none)
Tags
Stats
Related papers
- Temporal Perceiving Video-language Pre-training (2023)0.00
- Transvcl: Attention-enhanced Video Copy Localization Network With Flexible Supervision (2022)13.47
- Mv-adapter: Multimodal Video Transfer Learning For Video Text Retrieval (2023)9.76
- Clip2video: Mastering Video-text Retrieval Via Image CLIP (2021)0.00
- Video-text Pre-training With Learned Regions (2021)0.00
- Lat: Latent Translation With Cycle-consistency For Video-text Retrieval (2022)0.00
- Vop: Text-video Co-operative Prompt Tuning For Cross-modal Retrieval (2022)16.41
- Litevl: Efficient Video-language Learning With Enhanced Spatial-temporal Modeling (2022)6.34