Superclip: CLIP With Simple Classification Supervision
2025 Β· Weiheng Zhao, Zilong Huang, Jiashi Feng, et al.
Abstract
Contrastive Language-Image Pretraining (CLIP) achieves strong generalization in vision-language tasks by aligning images and texts in a shared embedding space. However, recent findings show that CLIP-like models still underutilize fine-grained semantic signals in text, and this issue becomes even more pronounced when dealing with long and detailed captions. This stems from CLIP's training objective, which optimizes only global image-text similarity and overlooks token-level supervision - limiting its ability to achieve fine-grained visual-text alignment. To address this, we propose SuperCLIP, a simple yet effective framework that augments contrastive learning with classification-based supervision. By adding only a lightweight linear layer to the vision encoder, SuperCLIP leverages token-level cues to enhance visual-textual alignment - with just a 0.077% increase in total FLOPs, and no need for additional annotated data. Experiments show that SuperCLIP consistently improves zero-shot cl
Authors
(none)
Tags
Stats
Related papers
- Clip-lite: Information Efficient Visual Representation Learning With Language Supervision (2021)2.35
- Goldiclip: The Goldilocks Approach For Balancing Explicit Supervision For Language-image Pretraining (2026)0.00
- FG-CLIP: Fine-grained Visual And Textual Alignment (2025)5.75
- Contextclip: Contextual Alignment Of Image-text Pairs On CLIP Visual Representations (2022)5.84
- CLIP Is Shortsighted: Paying Attention Beyond The First Sentence (2026)0.00
- Long-clip: Unlocking The Long-text Capability Of CLIP (2024)14.90
- \(\beta\)-clip: Text-conditioned Contrastive Learning For Multi-granular Vision-language Alignment (2025)2.16
- RECLIP: Resource-efficient CLIP By Training With Small Images (2023)0.00