Revitalize Region Feature For Democratizing Video-language Pre-training Of Retrieval
2022 Β· Guanyu Cai, Yixiao Ge, Binjie Zhang, et al.
Abstract
Recent dominant methods for video-language pre-training (VLP) learn transferable representations from the raw pixels in an end-to-end manner to achieve advanced performance on downstream video-language retrieval. Despite the impressive results, VLP research becomes extremely expensive with the need for massive data and a long training time, preventing further explorations. In this work, we revitalize region features of sparsely sampled video clips to significantly reduce both spatial and temporal visual redundancy towards democratizing VLP research at the same time achieving state-of-the-art results. Specifically, to fully explore the potential of region features, we introduce a novel bidirectional region-word alignment regularization that properly optimizes the fine-grained relations between regions and certain words in sentences, eliminating the domain/modality disconnections between pre-extracted region features and text. Extensive results of downstream video-language retrieval task
Authors
(none)
Tags
Stats
Related papers
- Video-text Pre-training With Learned Regions (2021)0.00
- Litevl: Efficient Video-language Learning With Enhanced Spatial-temporal Modeling (2022)6.34
- Temporal Perceiving Video-language Pre-training (2023)0.00
- Locvtp: Video-text Pre-training For Temporal Localization (2022)11.39
- Rap: Redundancy-aware Video-language Pre-training For Text-video Retrieval (2022)7.05
- Clip-vip: Adapting Pre-trained Image-text Model To Video-language Representation Alignment (2022)5.42
- Masked Contrastive Pre-training For Efficient Video-text Retrieval (2022)5.84
- Discovla: Discrepancy Reduction In Vision, Language, And Alignment For Parameter-efficient Video-text Retrieval (2025)6.30