Optimizing CLIP Models For Image Retrieval With Maintained Joint-embedding Alignment
2024 Β· Konstantin Schall, Kai Uwe Barthel, Nico Hezel, et al.
Abstract
Contrastive Language and Image Pairing (CLIP), a transformative method in multimedia retrieval, typically trains two neural networks concurrently to generate joint embeddings for text and image pairs. However, when applied directly, these models often struggle to differentiate between visually distinct images that have similar captions, resulting in suboptimal performance for image-based similarity searches. This paper addresses the challenge of optimizing CLIP models for various image-based similarity search scenarios, while maintaining their effectiveness in text-based search tasks such as text-to-image retrieval and zero-shot classification. We propose and evaluate two novel methods aimed at refining the retrieval capabilities of CLIP without compromising the alignment between text and image embeddings. The first method involves a sequential fine-tuning process: initially optimizing the image encoder for more precise image retrieval and subsequently realigning the text encoder to th
Authors
(none)
Tags
Stats
Related papers
- Finetuning CLIP To Reason About Pairwise Differences (2024)0.00
- Contextclip: Contextual Alignment Of Image-text Pairs On CLIP Visual Representations (2022)5.84
- Linear Alignment Of Vision-language Models For Image Captioning (2023)0.00
- C-CLIP: Contrastive Image-text Encoders To Close The Descriptive-commentative Gap (2023)0.00
- Enhancing Image Retrieval : A Comprehensive Study On Photo Search Using The CLIP Mode (2024)0.00
- Jina CLIP: Your CLIP Model Is Also Your Text Retriever (2024)0.00
- Cross The Gap: Exposing The Intra-modal Misalignment In CLIP Via Modality Inversion (2025)3.64
- Isoclip: Decomposing CLIP Projectors For Efficient Intra-modal Alignment (2026)3.06