Centerclip: Token Clustering For Efficient Text-video Retrieval
2022 Β· Shuai Zhao, Linchao Zhu, Xiaohan Wang, et al.
Abstract
Recently, large-scale pre-training methods like CLIP have made great progress in multi-modal research such as text-video retrieval. In CLIP, transformers are vital for modeling complex multi-modal relations. However, in the vision transformer of CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive and similar frames in videos. This significantly increases computation costs and hinders the deployment of video retrieval models in web applications. In this paper, to reduce the number of redundant video tokens, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones. As the frame redundancy occurs mostly in consecutive frames, we divide videos into multiple segments and conduct segment-level clustering. Center tokens from each segment are later concatenated into a new sequence, while their original spat
Authors
(none)
Tags
Stats
Related papers
- CLIP2TV: Align, Match And Distill For Video-text Retrieval (2021)0.00
- Prompt Switch: Efficient CLIP Adaptation For Text-video Retrieval (2023)11.93
- Clip2video: Mastering Video-text Retrieval Via Image CLIP (2021)0.00
- Tempme: Video Temporal Token Merging For Efficient Text-video Retrieval (2024)2.86
- X-CLIP: End-to-end Multi-grained Contrastive Learning For Video-text Retrieval (2022)18.12
- Teachclip: Multi-grained Teaching For Efficient Text-to-video Retrieval (2023)0.00
- Ts2-net: Token Shift And Selection Transformer For Text-video Retrieval (2022)15.51
- Prota: Probabilistic Token Aggregation For Text-video Retrieval (2024)4.52