M2-RAAP: A Multi-modal Recipe For Advancing Adaptation-based Pre-training Towards Effective And Efficient Zero-shot Video-text Retrieval
2024 Β· Xingning Dong, Zipeng Feng, Chunluan Zhou, et al.
Abstract
We present a Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards effective and efficient zero-shot video-text retrieval, dubbed M2-RAAP. Upon popular image-text models like CLIP, most current adaptation-based video-text pre-training methods are confronted by three major issues, i.e., noisy data corpus, time-consuming pre-training, and limited performance gain. Towards this end, we conduct a comprehensive study including four critical steps in video-text pre-training. Specifically, we investigate 1) data filtering and refinement, 2) video input type selection, 3) temporal modeling, and 4) video feature enhancement. We then summarize this empirical study into the M2-RAAP recipe, where our technical contributions lie in 1) the data filtering and text re-writing pipeline resulting in 1M high-quality bilingual video-text pairs, 2) the replacement of video inputs with key-frames to accelerate pre-training, and 3) the Auxiliary-Caption-Guided (ACG) strategy to enhance video
Authors
(none)
Tags
Stats
Related papers
- Mv-adapter: Multimodal Video Transfer Learning For Video Text Retrieval (2023)9.76
- Clip2video: Mastering Video-text Retrieval Via Image CLIP (2021)0.00
- Masked Contrastive Pre-training For Efficient Video-text Retrieval (2022)5.84
- Teachclip: Multi-grained Teaching For Efficient Text-to-video Retrieval (2023)0.00
- Towards Fast Adaptation Of Pretrained Contrastive Models For Multi-channel Video-language Retrieval (2022)7.50
- Frame-difference Guided Dynamic Region Perception For CLIP Adaptation In Text-video Retrieval (2025)0.00
- Learning Audio-video Modalities From Image Captions (2022)12.54
- Prompt Switch: Efficient CLIP Adaptation For Text-video Retrieval (2023)11.93