Boosting Video-text Retrieval With Explicit High-level Semantics
2022 Β· Haoran Wang, di Xu, Dongliang He, et al.
Abstract
Video-text retrieval (VTR) is an attractive yet challenging task for multi-modal understanding, which aims to search for relevant video (text) given a query (video). Existing methods typically employ completely heterogeneous visual-textual information to align video and text, whilst lacking the awareness of homogeneous high-level semantic information residing in both modalities. To fill this gap, in this work, we propose a novel visual-linguistic aligning model named HiSE for VTR, which improves the cross-modal representation by incorporating explicit high-level semantics. First, we explore the hierarchical property of explicit high-level semantics, and further decompose it into two levels, i.e. discrete semantics and holistic semantics. Specifically, for visual branch, we exploit an off-the-shelf semantic entity predictor to generate discrete high-level semantics. In parallel, a trained video captioning model is employed to output holistic high-level semantics. As for the textual moda
Authors
(none)
Tags
Stats
Related papers
- Delving Deeper: Hierarchical Visual Perception For Robust Video-text Retrieval (2026)1.24
- Towards Balanced Alignment: Modal-enhanced Semantic Modeling For Video Moment Retrieval (2023)14.33
- UATVR: Uncertainty-adaptive Text-video Retrieval (2023)15.46
- Unifying Latent And Lexicon Representations For Effective Video-text Retrieval (2024)0.00
- Improving Video-text Retrieval By Multi-stream Corpus Alignment And Dual Softmax Loss (2021)0.00
- Hanet: Hierarchical Alignment Networks For Video-text Retrieval (2021)0.00
- Generative Recall, Dense Reranking: Learning Multi-view Semantic Ids For Efficient Text-to-video Retrieval (2026)0.00
- Bridging Information Asymmetry In Text-video Retrieval: A Data-centric Approach (2024)0.00