Rap: Redundancy-aware Video-language Pre-training For Text-video Retrieval
2022 Β· Xing Wu, Chaochen Gao, Zijia Lin, et al.
Abstract
Video language pre-training methods have mainly adopted sparse sampling techniques to alleviate the temporal redundancy of videos. Though effective, sparse sampling still suffers inter-modal redundancy: visual redundancy and textual redundancy. Compared with highly generalized text, sparsely sampled frames usually contain text-independent portions, called visual redundancy. Sparse sampling is also likely to miss important frames corresponding to some text portions, resulting in textual redundancy. Inter-modal redundancy leads to a mismatch of video and text information, hindering the model from better learning the shared semantics across modalities. To alleviate it, we propose Redundancy-aware Video-language Pre-training. We design a redundancy measurement of video patches and text tokens by calculating the cross-modal minimum dis-similarity. Then, we penalize the highredundant video patches and text tokens through a proposed redundancy-aware contrastive learning. We evaluate our metho
Authors
(none)
Tags
Stats
Related papers
- Discovla: Discrepancy Reduction In Vision, Language, And Alignment For Parameter-efficient Video-text Retrieval (2025)6.30
- Masked Contrastive Pre-training For Efficient Video-text Retrieval (2022)5.84
- Temporal Perceiving Video-language Pre-training (2023)0.00
- Lat: Latent Translation With Cycle-consistency For Video-text Retrieval (2022)0.00
- Revitalize Region Feature For Democratizing Video-language Pre-training Of Retrieval (2022)2.72
- TEACHTEXT: Crossmodal Generalized Distillation For Text-video Retrieval (2021)15.43
- Clip2video: Mastering Video-text Retrieval Via Image CLIP (2021)0.00
- Prompt Switch: Efficient CLIP Adaptation For Text-video Retrieval (2023)11.93