T2VLAD: Global-local Sequence Alignment For Text-video Retrieval
2021 Β· Xiaohan Wang, Linchao Zhu, Yi Yang
Abstract
Text-video retrieval is a challenging task that aims to search relevant video contents based on natural language descriptions. The key to this problem is to measure text-video similarities in a joint embedding space. However, most existing methods only consider the global cross-modal similarity and overlook the local details. Some works incorporate the local comparisons through cross-modal local matching and reasoning. These complex operations introduce tremendous computation. In this paper, we design an efficient global-local alignment method. The multi-modal video sequences and text features are adaptively aggregated with a set of shared semantic centers. The local cross-modal similarities are computed between the video feature and text feature within the same center. This design enables the meticulous local comparison and reduces the computational cost of the interaction between each text-video pair. Moreover, a global alignment method is proposed to provide a global cross-modal mea
Authors
(none)
Tags
Stats
Related papers
- Text-video Retrieval With Global-local Semantic Consistent Learning (2024)8.75
- T2vparser: Adaptive Decomposition Tokens For Partial Alignment In Text To Video Retrieval (2025)0.95
- CLIP2TV: Align, Match And Distill For Video-text Retrieval (2021)0.00
- Hanet: Hierarchical Alignment Networks For Video-text Retrieval (2021)0.00
- Improving Video-text Retrieval By Multi-stream Corpus Alignment And Dual Softmax Loss (2021)0.00
- Tokenbinder: Text-video Retrieval With One-to-many Alignment Paradigm (2024)4.52
- Lat: Latent Translation With Cycle-consistency For Video-text Retrieval (2022)0.00
- Dual Encoding For Video Retrieval By Text (2020)16.05