Contrastive Cross-modal Knowledge Sharing Pre-training For Vision-language Representation Learning And Retrieval
2022 Β· Keyu Wen, Zhenshan Tan, Qingrong Cheng, et al.
Abstract
Recently, the cross-modal pre-training task has been a hotspot because of its wide application in various down-streaming researches including retrieval, captioning, question answering and so on. However, exiting methods adopt a one-stream pre-training model to explore the united vision-language representation for conducting cross-modal retrieval, which easily suffer from the calculation explosion. Moreover, although the conventional double-stream structures are quite efficient, they still lack the vital cross-modal interactions, resulting in low performances. Motivated by these challenges, we put forward a Contrastive Cross-Modal Knowledge Sharing Pre-training (COOKIE) to grasp the joint text-image representations. Structurally, COOKIE adopts the traditional double-stream structure because of the acceptable time consumption. To overcome the inherent defects of double-stream structure as mentioned above, we elaborately design two effective modules. Concretely, the first module is a weig
Authors
(none)
Tags
Stats
Related papers
- COTS: Collaborative Two-stream Vision-language Pre-training Model For Cross-modal Retrieval (2022)13.60
- Multimodal Contrastive Training For Visual Representation Learning (2021)16.32
- CAVL: Learning Contrastive And Adaptive Representations Of Vision And Language (2023)0.00
- Cross-view Language Modeling: Towards Unified Cross-lingual Cross-modal Pre-training (2022)8.09
- Improving The Consistency In Cross-lingual Cross-modal Retrieval With 1-to-k Contrastive Learning (2024)5.84
- A Comprehensive Empirical Study Of Vision-language Pre-trained Model For Supervised Cross-modal Retrieval (2022)0.00
- CL2CM: Improving Cross-lingual Cross-modal Retrieval Via Cross-lingual Knowledge Transfer (2023)8.60
- Efficientclip: Efficient Cross-modal Pre-training By Ensemble Confident Learning And Language Modeling (2021)0.00