Tempme: Video Temporal Token Merging For Efficient Text-video Retrieval
2024 Β· Leqi Shen, Tianxiang Hao, Tao He, et al.
Abstract
Most text-video retrieval methods utilize the text-image pre-trained models like CLIP as a backbone. These methods process each sampled frame independently by the image encoder, resulting in high computational overhead and limiting practical deployment. Addressing this, we focus on efficient text-video retrieval by tackling two key challenges: 1. From the perspective of trainable parameters, current parameter-efficient fine-tuning methods incur high inference costs; 2. From the perspective of model complexity, current token compression methods are mainly designed for images to reduce spatial redundancy but overlook temporal redundancy in consecutive frames of a video. To tackle these challenges, we propose Temporal Token Merging (TempMe), a parameter-efficient and training-inference efficient text-video retrieval architecture that minimizes trainable parameters and model complexity. Specifically, we introduce a progressive multi-granularity framework. By gradually combining neighboring
Authors
(none)
Tags
Stats
Related papers
- Prompt Switch: Efficient CLIP Adaptation For Text-video Retrieval (2023)11.93
- Centerclip: Token Clustering For Efficient Text-video Retrieval (2022)15.54
- Towards Efficient And Robust Moment Retrieval System: A Unified Framework For Multi-granularity Models And Temporal Reranking (2025)2.26
- Prota: Probabilistic Token Aggregation For Text-video Retrieval (2024)4.52
- Ts2-net: Token Shift And Selection Transformer For Text-video Retrieval (2022)15.51
- T2vparser: Adaptive Decomposition Tokens For Partial Alignment In Text To Video Retrieval (2025)0.95
- Vop: Text-video Co-operative Prompt Tuning For Cross-modal Retrieval (2022)16.41
- Madtempo: An Interactive System For Multi-event Temporal Video Retrieval With Query Augmentation (2025)0.00