Hanet: Hierarchical Alignment Networks For Video-text Retrieval
2021 Β· Peng Wu, Xiangteng He, Mingqian Tang, et al.
Abstract
Video-text retrieval is an important yet challenging task in vision-language understanding, which aims to learn a joint embedding space where related video and text instances are close to each other. Most current works simply measure the video-text similarity based on video-level and text-level embeddings. However, the neglect of more fine-grained or local information causes the problem of insufficient representation. Some works exploit the local details by disentangling sentences, but overlook the corresponding videos, causing the asymmetry of video-text representation. To address the above limitations, we propose a Hierarchical Alignment Network (HANet) to align different level representations for video-text matching. Specifically, we first decompose video and text into three semantic levels, namely event (video and text), action (motion and verb), and entity (appearance and noun). Based on these, we naturally construct hierarchical representations in the individual-local-global mann
Authors
(none)
Tags
Stats
Related papers
- Delving Deeper: Hierarchical Visual Perception For Robust Video-text Retrieval (2026)1.24
- Fine-grained Video-text Retrieval With Hierarchical Graph Reasoning (2020)18.27
- Tencent Text-video Retrieval: Hierarchical Cross-modal Interactions With Multi-level Representations (2022)7.81
- Text-based Localization Of Moments In A Video Corpus (2020)10.35
- HGAN: Hierarchical Graph Alignment Network For Image-text Retrieval (2022)11.93
- Tagging Before Alignment: Integrating Multi-modal Tags For Video-text Retrieval (2023)10.74
- Hyperbolic Hierarchical Alignment Reasoning Network For Text-3d Retrieval (2025)1.81
- Text-video Retrieval Via Variational Multi-modal Hypergraph Networks (2024)0.00