A Comprehensive Empirical Study Of Vision-language Pre-trained Model For Supervised Cross-modal Retrieval
2022 Β· Zhixiong Zeng, Wenji Mao
Abstract
Cross-Modal Retrieval (CMR) is an important research topic across multimodal computing and information retrieval, which takes one type of data as the query to retrieve relevant data of another type. It has been widely used in many real-world applications. Recently, the vision-language pre-trained models represented by CLIP demonstrate its superiority in learning the visual and textual representations and gain impressive performance on various vision and language related tasks. Although CLIP as well as the previous pre-trained models have shown great performance improvement in the unsupervised CMR, the performance and impact of these pre-trained models on the supervised CMR were rarely explored due to the lack of common representation for the multimodal class-level associations. In this paper, we take CLIP as the current representative vision-language pre-trained model to conduct a comprehensive empirical study. We evaluate its performance and impact on the supervised CMR, and attempt t
Authors
(none)
Tags
Stats
Related papers
- Scene-centric Vs. Object-centric Image-text Cross-modal Retrieval: A Reproducibility Study (2023)5.24
- Enhancing Image Retrieval : A Comprehensive Study On Photo Search Using The CLIP Mode (2024)0.00
- Clip4clip: An Empirical Study Of CLIP For End To End Video Clip Retrieval (2021)6.02
- COTS: Collaborative Two-stream Vision-language Pre-training Model For Cross-modal Retrieval (2022)13.60
- Efficientclip: Efficient Cross-modal Pre-training By Ensemble Confident Learning And Language Modeling (2021)0.00
- Multi-task Cross-modal Learning For Chest X-ray Image Retrieval (2026)0.00
- CL2CM: Improving Cross-lingual Cross-modal Retrieval Via Cross-lingual Knowledge Transfer (2023)8.60
- Cross-view Language Modeling: Towards Unified Cross-lingual Cross-modal Pre-training (2022)8.09