Hit: Hierarchical Transformer With Momentum Contrast For Video-text Retrieval
2021 Β· Song Liu, Haoqi Fan, Shengsheng Qian, et al.
Abstract
Video-Text Retrieval has been a hot research topic with the growth of multimedia data on the internet. Transformer for video-text learning has attracted increasing attention due to its promising performance. However, existing cross-modal transformer approaches typically suffer from two major limitations: 1) Exploitation of the transformer architecture where different layers have different feature characteristics is limited; 2) End-to-end training mechanism limits negative sample interactions in a mini-batch. In this paper, we propose a novel approach named Hierarchical Transformer (HiT) for video-text retrieval. HiT performs Hierarchical Cross-modal Contrastive Matching in both feature-level and semantic-level, achieving multi-view and comprehensive retrieval results. Moreover, inspired by MoCo, we propose Momentum Cross-modal Contrast for cross-modal learning to enable large-scale negative sample interactions on-the-fly, which contributes to the generation of more precise and discrimi
Authors
(none)
Tags
Stats
Related papers
- Multi-modal Transformer For Video Retrieval (2020)19.47
- CLIP2TV: Align, Match And Distill For Video-text Retrieval (2021)0.00
- Tencent Text-video Retrieval: Hierarchical Cross-modal Interactions With Multi-level Representations (2022)7.81
- Dual-modal Attention-enhanced Text-video Retrieval With Triplet Partial Margin Contrastive Learning (2023)8.82
- Everything At Once -- Multi-modal Fusion Transformer For Video Retrieval (2021)15.78
- Contra: (con)text (tra)nsformer For Cross-modal Video Retrieval (2022)2.26
- Lat: Latent Translation With Cycle-consistency For Video-text Retrieval (2022)0.00
- MDMMT-2: Multidomain Multimodal Transformer For Video Retrieval, One More Step Towards Generalization (2022)0.00