Calibclip: Contextual Calibration Of Dominant Semantics For Text-driven Image Retrieval
2025 Β· Bin Kang, Bin Chen, Junjie Wang, et al.
Abstract
Existing Visual Language Models (VLMs) suffer structural limitations where a few low contribution tokens may excessively capture global semantics, dominating the information aggregation process and suppressing the discriminative features in text-driven image retrieval tasks. To address this, we introduce \textbf\{CalibCLIP\}, a training-free method designed to calibrate the suppressive effect of dominant tokens. Specifically, in the visual space, we propose the Contrastive Visual Enhancer (CVE), which decouples visual features into target and low information regions. Subsequently, it identifies dominant tokens and dynamically suppresses their representations.In the textual space, we introduce the Discriminative Concept Calibrator (DCC), which aims to differentiate between general and discriminative concepts within the text query. By mitigating the challenges posed by generic concepts and improving the representations of discriminative concepts, DCC strengthens the differentiation among
Authors
(none)
Tags
Stats
Related papers
- Finetuning CLIP To Reason About Pairwise Differences (2024)0.00
- \(\beta\)-clip: Text-conditioned Contrastive Learning For Multi-granular Vision-language Alignment (2025)2.16
- Distill CLIP (DCLIP): Enhancing Image-text Retrieval Via Cross-modal Transformer Distillation (2025)0.00
- Exploring A Unified Vision-centric Contrastive Alternatives On Multi-modal Web Documents (2025)1.69
- Contrasting Intra-modal And Ranking Cross-modal Hard Negatives To Enhance Visio-linguistic Compositional Understanding (2023)12.11
- Contextblip: Doubly Contextual Alignment For Contrastive Image Retrieval From Linguistically Complex Descriptions (2024)0.00
- Contextclip: Contextual Alignment Of Image-text Pairs On CLIP Visual Representations (2022)5.84
- VL-CLIP: Enhancing Multimodal Recommendations Via Visual Grounding And Llm-augmented CLIP Embeddings (2025)2.26