Fine-grained Video-text Retrieval With Hierarchical Graph Reasoning
2020 Β· Shizhe Chen, Yida Zhao, Qin Jin, et al.
Abstract
Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of videos on the web. The current dominant approach for this problem is to learn a joint embedding space to measure cross-modal similarities. However, simple joint embeddings are insufficient to represent complicated visual and textual details, such as scenes, objects, actions and their compositions. To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model, which decomposes video-text matching into global-to-local levels. To be specific, the model disentangles texts into hierarchical semantic graph including three levels of events, actions, entities and relationships across levels. Attention-based graph reasoning is utilized to generate hierarchical textual embeddings, which can guide the learning of diverse and hierarchical video representations. The HGR model aggregates matchings from different video-text levels to capture both global a
Authors
(none)
Tags
Stats
Related papers
- Tencent Text-video Retrieval: Hierarchical Cross-modal Interactions With Multi-level Representations (2022)7.81
- Text-video Retrieval Via Variational Multi-modal Hypergraph Networks (2024)0.00
- Delving Deeper: Hierarchical Visual Perception For Robust Video-text Retrieval (2026)1.24
- Hanet: Hierarchical Alignment Networks For Video-text Retrieval (2021)0.00
- Graph-based Hierarchical Relevance Matching Signals For Ad-hoc Retrieval (2021)8.60
- RAVU: Retrieval Augmented Video Understanding With Compositional Reasoning Over Graph (2025)0.00
- Tree-augmented Cross-modal Encoding For Complex-query Video Retrieval (2020)15.57
- Multi-modal Reasoning Graph For Scene-text Based Fine-grained Image Classification And Retrieval (2020)11.29