COTS: Collaborative Two-stream Vision-language Pre-training Model For Cross-modal Retrieval
2022 Β· Haoyu Lu, Nanyi Fei, Yuqi Huo, et al.
Abstract
Large-scale single-stream pre-training has shown dramatic performance in image-text retrieval. Regrettably, it faces low inference efficiency due to heavy attention layers. Recently, two-stream methods like CLIP and ALIGN with high inference efficiency have also shown promising performance, however, they only consider instance-level alignment between the two streams (thus there is still room for improvement). To overcome these limitations, we propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval by enhancing cross-modal interaction. In addition to instance level alignment via momentum contrastive learning, we leverage two extra levels of cross-modal interactions in our COTS: (1) Token-level interaction - a masked visionlanguage modeling (MVLM) learning objective is devised without using a cross-stream network module, where variational autoencoder is imposed on the visual encoder to generate visual tokens for each image. (2) Task
Authors
(none)
Tags
Stats
Related papers
- Contrastive Cross-modal Knowledge Sharing Pre-training For Vision-language Representation Learning And Retrieval (2022)0.00
- Clip2video: Mastering Video-text Retrieval Via Image CLIP (2021)0.00
- Clip-vip: Adapting Pre-trained Image-text Model To Video-language Representation Alignment (2022)5.42
- A Comprehensive Empirical Study Of Vision-language Pre-trained Model For Supervised Cross-modal Retrieval (2022)0.00
- Curriculum Learning For Data-efficient Vision-language Alignment (2022)2.26
- Improving Video-text Retrieval By Multi-stream Corpus Alignment And Dual Softmax Loss (2021)0.00
- CL2CM: Improving Cross-lingual Cross-modal Retrieval Via Cross-lingual Knowledge Transfer (2023)8.60
- Efficientclip: Efficient Cross-modal Pre-training By Ensemble Confident Learning And Language Modeling (2021)0.00