Improving Video-text Retrieval By Multi-stream Corpus Alignment And Dual Softmax Loss
2021 Β· Xing Cheng, Hezheng Lin, Xiangyu Wu, et al.
Abstract
Employing large-scale pre-trained model CLIP to conduct video-text retrieval task (VTR) has become a new trend, which exceeds previous VTR methods. Though, due to the heterogeneity of structures and contents between video and text, previous CLIP-based models are prone to overfitting in the training phase, resulting in relatively poor retrieval performance. In this paper, we propose a multi-stream Corpus Alignment network with single gate Mixture-of-Experts (CAMoE) and a novel Dual Softmax Loss (DSL) to solve the two heterogeneity. The CAMoE employs Mixture-of-Experts (MoE) to extract multi-perspective video representations, including action, entity, scene, etc., then align them with the corresponding part of the text. In this stage, we conduct massive explorations towards the feature extraction module and feature alignment module. DSL is proposed to avoid the one-way optimum-match which occurs in previous contrastive methods. Introducing the intrinsic prior of each pair in a batch, DSL
Authors
(none)
Tags
Stats
Related papers
- X-CLIP: End-to-end Multi-grained Contrastive Learning For Video-text Retrieval (2022)18.12
- CLIP2TV: Align, Match And Distill For Video-text Retrieval (2021)0.00
- Mv-adapter: Multimodal Video Transfer Learning For Video Text Retrieval (2023)9.76
- COTS: Collaborative Two-stream Vision-language Pre-training Model For Cross-modal Retrieval (2022)13.60
- Clip2video: Mastering Video-text Retrieval Via Image CLIP (2021)0.00
- Towards Fast Adaptation Of Pretrained Contrastive Models For Multi-channel Video-language Retrieval (2022)7.50
- Video Corpus Moment Retrieval With Contrastive Learning (2021)14.35
- Discovla: Discrepancy Reduction In Vision, Language, And Alignment For Parameter-efficient Video-text Retrieval (2025)6.30