Video-language Alignment Via Spatio-temporal Graph Transformer
2024 Β· Shi-Xue Zhang, Hongfa Wang, Xiaobin Zhu, et al.
Abstract
Video-language alignment is a crucial multi-modal task that benefits various downstream applications, e.g., video-text retrieval and video question answering. Existing methods either utilize multi-modal information in video-text pairs or apply global and local alignment techniques to promote alignment precision. However, these methods often fail to fully explore the spatio-temporal relationships among vision tokens within video and across different video-text pairs. In this paper, we propose a novel Spatio-Temporal Graph Transformer module to uniformly learn spatial and temporal contexts for video-language alignment pre-training (dubbed STGT). Specifically, our STGT combines spatio-temporal graph structure information with attention in transformer block, effectively utilizing the spatio-temporal contexts. In this way, we can model the relationships between vision tokens, promoting video-text alignment precision for benefiting downstream tasks. In addition, we propose a self-similarity
Authors
(none)
Tags
Stats
Related papers
- Tagging Before Alignment: Integrating Multi-modal Tags For Video-text Retrieval (2023)10.74
- A Multi-level Alignment Training Scheme For Video-and-language Grounding (2022)3.58
- Multi-modal Transformer For Video Retrieval (2020)19.47
- Lat: Latent Translation With Cycle-consistency For Video-text Retrieval (2022)0.00
- Temporal Perceiving Video-language Pre-training (2023)0.00
- T2vparser: Adaptive Decomposition Tokens For Partial Alignment In Text To Video Retrieval (2025)0.95
- Locvtp: Video-text Pre-training For Temporal Localization (2022)11.39
- CLIP2TV: Align, Match And Distill For Video-text Retrieval (2021)0.00