Tagging Before Alignment: Integrating Multi-modal Tags For Video-text Retrieval
2023 Β· Yizhen Chen, Jie Wang, Lijian Lin, et al.
Abstract
Vision-language alignment learning for video-text retrieval arouses a lot of attention in recent years. Most of the existing methods either transfer the knowledge of image-text pretraining model to video-text retrieval task without fully exploring the multi-modal information of videos, or simply fuse multi-modal features in a brute force manner without explicit guidance. In this paper, we integrate multi-modal information in an explicit manner by tagging, and use the tags as the anchors for better video-text alignment. Various pretrained experts are utilized for extracting the information of multiple modalities, including object, person, motion, audio, etc. To take full advantage of these information, we propose the TABLE (TAgging Before aLignmEnt) network, which consists of a visual encoder, a tag encoder, a text encoder, and a tag-guiding cross-modal encoder for jointly encoding multi-frame visual features and multi-modal tags information. Furthermore, to strengthen the interaction b
Authors
(none)
Tags
Stats
Related papers
- Hanet: Hierarchical Alignment Networks For Video-text Retrieval (2021)0.00
- Improving Video-text Retrieval By Multi-stream Corpus Alignment And Dual Softmax Loss (2021)0.00
- T2vparser: Adaptive Decomposition Tokens For Partial Alignment In Text To Video Retrieval (2025)0.95
- Video-language Alignment Via Spatio-temporal Graph Transformer (2024)0.00
- A Multi-level Alignment Training Scheme For Video-and-language Grounding (2022)3.58
- Multilevel Language And Vision Integration For Text-to-clip Retrieval (2018)17.67
- Multi-modal Transformer For Video Retrieval (2020)19.47
- Tokenbinder: Text-video Retrieval With One-to-many Alignment Paradigm (2024)4.52