DGL: Dynamic Global-local Prompt Tuning For Text-video Retrieval
2024 Β· Xiangpeng Yang, Linchao Zhu, Xiaohan Wang, et al.
Abstract
Text-video retrieval is a critical multi-modal task to find the most relevant video for a text query. Although pretrained models like CLIP have demonstrated impressive potential in this area, the rising cost of fully finetuning these models due to increasing model size continues to pose a problem. To address this challenge, prompt tuning has emerged as an alternative. However, existing works still face two problems when adapting pretrained image-text models to downstream video-text tasks: (1) The visual encoder could only encode frame-level features and failed to extract global-level general video information. (2) Equipping the visual and text encoder with separated prompts failed to mitigate the visual-text modality gap. To this end, we propose DGL, a cross-modal Dynamic prompt tuning method with Global-Local video attention. In contrast to previous prompt tuning methods, we employ the shared latent space to generate local-level text and frame prompts that encourage inter-modal intera
Authors
(none)
Tags
Stats
Related papers
- Vop: Text-video Co-operative Prompt Tuning For Cross-modal Retrieval (2022)16.41
- Prompt Switch: Efficient CLIP Adaptation For Text-video Retrieval (2023)11.93
- Text-video Retrieval With Global-local Semantic Consistent Learning (2024)8.75
- TEACHTEXT: Crossmodal Generalized Distillation For Text-video Retrieval (2021)15.43
- Retrieval-augmented Dynamic Prompt Tuning For Incomplete Multimodal Learning (2025)8.87
- Prompt-aware Of Frame Sampling For Efficient Text-video Retrieval (2025)0.95
- Parameter-efficient Prompt Tuning Makes Generalized And Calibrated Neural Text Retrievers (2022)5.84
- Fine-grained Retrieval Prompt Tuning (2022)10.07