Tencent Text-video Retrieval: Hierarchical Cross-modal Interactions With Multi-level Representations
2022 Β· Jie Jiang, Shaobo Min, Weijie Kong, et al.
Abstract
Text-Video Retrieval plays an important role in multi-modal understanding and has attracted increasing attention in recent years. Most existing methods focus on constructing contrastive pairs between whole videos and complete caption sentences, while overlooking fine-grained cross-modal relationships, e.g., clip-phrase or frame-word. In this paper, we propose a novel method, named Hierarchical Cross-Modal Interaction (HCMI), to explore multi-level cross-modal relationships among video-sentence, clip-phrase, and frame-word for text-video retrieval. Considering intrinsic semantic frame relations, HCMI performs self-attention to explore frame-level correlations and adaptively cluster correlated frames into clip-level and video-level representations. In this way, HCMI constructs multi-level video representations for frame-clip-video granularities to capture fine-grained video content, and multi-level text representations at word-phrase-sentence granularities for the text modality. With mul
Authors
(none)
Tags
Stats
Related papers
- Fine-grained Video-text Retrieval With Hierarchical Graph Reasoning (2020)18.27
- Hanet: Hierarchical Alignment Networks For Video-text Retrieval (2021)0.00
- Delving Deeper: Hierarchical Visual Perception For Robust Video-text Retrieval (2026)1.24
- Hit: Hierarchical Transformer With Momentum Contrast For Video-text Retrieval (2021)15.98
- Multi-modal Transformer For Video Retrieval (2020)19.47
- TC-MGC: Text-conditioned Multi-grained Contrastive Learning For Text-video Retrieval (2025)6.93
- X-CLIP: End-to-end Multi-grained Contrastive Learning For Video-text Retrieval (2022)18.12
- Disentangled Representation Learning For Text-video Retrieval (2022)0.00