Prompt Switch: Efficient CLIP Adaptation For Text-video Retrieval
2023 Β· Chaorui Deng, Qi Chen, Pengda Qin, et al.
Abstract
In text-video retrieval, recent works have benefited from the powerful learning capabilities of pre-trained text-image foundation models (e.g., CLIP) by adapting them to the video domain. A critical problem for them is how to effectively capture the rich semantics inside the video using the image encoder of CLIP. To tackle this, state-of-the-art methods adopt complex cross-modal modeling techniques to fuse the text information into video frame representations, which, however, incurs severe efficiency issues in large-scale retrieval systems as the video representations must be recomputed online for every text query. In this paper, we discard this problematic cross-modal fusion process and aim to learn semantically-enhanced representations purely from the video, so that the video representations can be computed offline and reused for different texts. Concretely, we first introduce a spatial-temporal "Prompt Cube" into the CLIP image encoder and iteratively switch it within the encoder la
Authors
(none)
Tags
Stats
Related papers
- Prompt-aware Of Frame Sampling For Efficient Text-video Retrieval (2025)0.95
- Clip2video: Mastering Video-text Retrieval Via Image CLIP (2021)0.00
- Fine-tuned CLIP Models Are Efficient Video Learners (2022)21.57
- Clip4clip: An Empirical Study Of CLIP For End To End Video Clip Retrieval (2021)6.02
- Vop: Text-video Co-operative Prompt Tuning For Cross-modal Retrieval (2022)16.41
- Text-video Retrieval With Global-local Semantic Consistent Learning (2024)8.75
- Clip-vip: Adapting Pre-trained Image-text Model To Video-language Representation Alignment (2022)5.42
- Centerclip: Token Clustering For Efficient Text-video Retrieval (2022)15.54