Ts2-net: Token Shift And Selection Transformer For Text-video Retrieval
2022 Β· Yuqi Liu, Pengfei Xiong, Luhui Xu, et al.
Abstract
Text-Video retrieval is a task of great practical value and has received increasing attention, among which learning spatial-temporal video representation is one of the research hotspots. The video encoders in the state-of-the-art video retrieval models usually directly adopt the pre-trained vision backbones with the network structure fixed, they therefore can not be further improved to produce the fine-grained spatial-temporal video representation. In this paper, we propose Token Shift and Selection Network (TS2-Net), a novel token shift and selection transformer architecture, which dynamically adjusts the token sequence and selects informative tokens in both temporal and spatial dimensions from input video samples. The token shift module temporally shifts the whole token features back-and-forth across adjacent frames, to preserve the complete token representation and capture subtle movements. Then the token selection module selects tokens that contribute most to local spatial semantic
Authors
(none)
Tags
Stats
Related papers
- CLIP2TV: Align, Match And Distill For Video-text Retrieval (2021)0.00
- Centerclip: Token Clustering For Efficient Text-video Retrieval (2022)15.54
- T2vparser: Adaptive Decomposition Tokens For Partial Alignment In Text To Video Retrieval (2025)0.95
- Tempme: Video Temporal Token Merging For Efficient Text-video Retrieval (2024)2.86
- Thinking Fast And Slow: Efficient Text-to-visual Retrieval With Transformers (2021)15.16
- Self-supervised Video Retrieval Transformer Network (2021)0.00
- MDMMT-2: Multidomain Multimodal Transformer For Video Retrieval, One More Step Towards Generalization (2022)0.00
- Multi-modal Transformer For Video Retrieval (2020)19.47