Vop: Text-video Co-operative Prompt Tuning For Cross-modal Retrieval
2022 Β· Siteng Huang, Biao Gong, Yulin Pan, et al.
Abstract
Many recent studies leverage the pre-trained CLIP for text-video cross-modal retrieval by tuning the backbone with additional heavy modules, which not only brings huge computational burdens with much more parameters, but also leads to the knowledge forgetting from upstream models. In this work, we propose the VoP: Text-Video Co-operative Prompt Tuning for efficient tuning on the text-video retrieval task. The proposed VoP is an end-to-end framework with both video & text prompts introducing, which can be regarded as a powerful baseline with only 0.1% trainable parameters. Further, based on the spatio-temporal characteristics of videos, we develop three novel video prompt mechanisms to improve the performance with different scales of trainable parameters. The basic idea of the VoP enhancement is to model the frame position, frame context, and layer function with specific trainable prompts, respectively. Extensive experiments show that compared to full fine-tuning, the enhanced VoP achie
Authors
(none)
Tags
Stats
Related papers
- DGL: Dynamic Global-local Prompt Tuning For Text-video Retrieval (2024)14.35
- Prompt Switch: Efficient CLIP Adaptation For Text-video Retrieval (2023)11.93
- Mv-adapter: Multimodal Video Transfer Learning For Video Text Retrieval (2023)9.76
- Pitl: Cross-modal Retrieval With Weakly-supervised Vision-language Pre-training Via Prompting (2023)7.16
- Tempme: Video Temporal Token Merging For Efficient Text-video Retrieval (2024)2.86
- Locvtp: Video-text Pre-training For Temporal Localization (2022)11.39
- Clip-vip: Adapting Pre-trained Image-text Model To Video-language Representation Alignment (2022)5.42
- X-CLIP: End-to-end Multi-grained Contrastive Learning For Video-text Retrieval (2022)18.12