Teachclip: Multi-grained Teaching For Efficient Text-to-video Retrieval
2023 Β· Kaibin Tian, Ruixiang Zhao, Hu Hu, et al.
Abstract
For text-to-video retrieval (T2VR), which aims to retrieve unlabeled videos by ad-hoc textual queries, CLIP-based methods are dominating. Compared to CLIP4Clip which is efficient and compact, the state-of-the-art models tend to compute video-text similarity by fine-grained cross-modal feature interaction and matching, putting their scalability for large-scale T2VR into doubt. For efficient T2VR, we propose TeachCLIP with multi-grained teaching to let a CLIP4Clip based student network learn from more advanced yet computationally heavy models such as X-CLIP, TS2-Net and X-Pool . To improve the student's learning capability, we add an Attentional frame-Feature Aggregation (AFA) block, which by design adds no extra storage/computation overhead at the retrieval stage. While attentive weights produced by AFA are commonly used for combining frame-level features, we propose a novel use of the weights to let them imitate frame-text relevance estimated by the teacher network. As such, AFA provid
Authors
(none)
Tags
Stats
Related papers
- Clip2video: Mastering Video-text Retrieval Via Image CLIP (2021)0.00
- X-CLIP: End-to-end Multi-grained Contrastive Learning For Video-text Retrieval (2022)18.12
- Prompt Switch: Efficient CLIP Adaptation For Text-video Retrieval (2023)11.93
- CLIP2TV: Align, Match And Distill For Video-text Retrieval (2021)0.00
- Fine-tuned CLIP Models Are Efficient Video Learners (2022)21.57
- TEACHTEXT: Crossmodal Generalized Distillation For Text-video Retrieval (2021)15.43
- Clip4clip: An Empirical Study Of CLIP For End To End Video Clip Retrieval (2021)6.02
- Centerclip: Token Clustering For Efficient Text-video Retrieval (2022)15.54