Masked Contrastive Pre-training For Efficient Video-text Retrieval
2022 Β· Fangxun Shu, Biaolong Chen, Yue Liao, et al.
Abstract
We present a simple yet effective end-to-end Video-language Pre-training (VidLP) framework, Masked Contrastive Video-language Pretraining (MAC), for video-text retrieval tasks. Our MAC aims to reduce video representation's spatial and temporal redundancy in the VidLP model by a mask sampling mechanism to improve pre-training efficiency. Comparing conventional temporal sparse sampling, we propose to randomly mask a high ratio of spatial regions and only feed visible regions into the encoder as sparse spatial sampling. Similarly, we adopt the mask sampling technique for text inputs for consistency. Instead of blindly applying the mask-then-prediction paradigm from MAE, we propose a masked-then-alignment paradigm for efficient video-text alignment. The motivation is that video-text retrieval tasks rely on high-level alignment rather than low-level reconstruction, and multimodal alignment with masked modeling encourages the model to learn a robust and general multimodal representation from
Authors
(none)
Tags
Stats
Related papers
- MILES: Visual BERT Pre-training With Injected Language Semantics For Video-text Retrieval (2022)10.61
- Clip2video: Mastering Video-text Retrieval Via Image CLIP (2021)0.00
- Mask To Reconstruct: Cooperative Semantics Completion For Video-text Retrieval (2023)5.24
- X-CLIP: End-to-end Multi-grained Contrastive Learning For Video-text Retrieval (2022)18.12
- Rap: Redundancy-aware Video-language Pre-training For Text-video Retrieval (2022)7.05
- Efficient Medical Vision-language Alignment Through Adapting Masked Vision Models (2025)5.74
- M2-RAAP: A Multi-modal Recipe For Advancing Adaptation-based Pre-training Towards Effective And Efficient Zero-shot Video-text Retrieval (2024)6.76
- Litevl: Efficient Video-language Learning With Enhanced Spatial-temporal Modeling (2022)6.34