Delving Deeper: Hierarchical Visual Perception For Robust Video-text Retrieval
2026 Β· Zequn Xie, Boyun Zhang, Yuxiao Lin, et al.
Abstract
Video-text retrieval (VTR) aims to locate relevant videos using natural language queries. Current methods, often based on pre-trained models like CLIP, are hindered by video's inherent redundancy and their reliance on coarse, final-layer features, limiting matching accuracy. To address this, we introduce the HVP-Net (Hierarchical Visual Perception Network), a framework that mines richer video semantics by extracting and refining features from multiple intermediate layers of a vision encoder. Our approach progressively distills salient visual concepts from raw patch-tokens at different semantic levels, mitigating redundancy while preserving crucial details for alignment. This results in a more robust video representation, leading to new state-of-the-art performance on challenging benchmarks including MSRVTT, DiDeMo, and ActivityNet. Our work validates the effectiveness of exploiting hierarchical features for advancing video-text retrieval. Our codes are available at https://github.com/b
Authors
(none)
Tags
Stats
Related papers
- HVD: Human Vision-driven Video Representation Learning For Text-video Retrieval (2026)0.00
- Boosting Video-text Retrieval With Explicit High-level Semantics (2022)7.50
- Hanet: Hierarchical Alignment Networks For Video-text Retrieval (2021)0.00
- Fine-grained Video-text Retrieval With Hierarchical Graph Reasoning (2020)18.27
- Tencent Text-video Retrieval: Hierarchical Cross-modal Interactions With Multi-level Representations (2022)7.81
- UATVR: Uncertainty-adaptive Text-video Retrieval (2023)15.46
- Hivlp: Hierarchical Vision-language Pre-training For Fast Image-text Retrieval (2022)0.00
- Multilevel Language And Vision Integration For Text-to-clip Retrieval (2018)17.67