Seeing What Matters: Empowering CLIP With Patch Generation-to-selection
2025 Β· Gensheng Pei, Tao Chen, Yujia Wang, et al.
Abstract
The CLIP model has demonstrated significant advancements in aligning visual and language modalities through large-scale pre-training on image-text pairs, enabling strong zero-shot classification and retrieval capabilities on various domains. However, CLIP's training remains computationally intensive, with high demands on both data processing and memory. To address these challenges, recent masking strategies have emerged, focusing on the selective removal of image patches to improve training efficiency. Although effective, these methods often compromise key semantic information, resulting in suboptimal alignment between visual features and text descriptions. In this work, we present a concise yet effective approach called Patch Generation-to-Selection to enhance CLIP's training efficiency while preserving critical semantic content. Our method introduces a gradual masking process in which a small set of candidate patches is first pre-selected as potential mask regions. Then, we apply Sob
Authors
(none)
Tags
Stats
Related papers
- Lightclip: Learning Multi-level Interaction For Lightweight Vision-language Models (2023)0.00
- Optimizing CLIP Models For Image Retrieval With Maintained Joint-embedding Alignment (2024)6.34
- Clip-lite: Information Efficient Visual Representation Learning With Language Supervision (2021)2.35
- Long-clip: Unlocking The Long-text Capability Of CLIP (2024)14.90
- CLIP Is Shortsighted: Paying Attention Beyond The First Sentence (2026)0.00
- Prompt Switch: Efficient CLIP Adaptation For Text-video Retrieval (2023)11.93
- Enhancing Image Retrieval : A Comprehensive Study On Photo Search Using The CLIP Mode (2024)0.00
- Priorclip: Visual Prior Guided Vision-language Model For Remote Sensing Image-text Retrieval (2024)0.00