CLIP Is Shortsighted: Paying Attention Beyond The First Sentence
2026 Β· Marc-Antoine Lavoie, Anas Mahmoud, Aldo Zaimi, et al.
Abstract
CLIP models learn transferable multi-modal features via image-text contrastive learning on internet-scale data. They are widely used in zero-shot classification, multi-modal retrieval, text-to-image diffusion, and as image encoders in large vision-language models. However, CLIP's pretraining is dominated by images paired with short captions, biasing the model toward encoding simple descriptions of salient objects and leading to coarse alignment on complex scenes and dense descriptions. While recent work mitigates this by fine-tuning on small-scale long-caption datasets, we identify an important common bias: both human- and LLM-generated long captions typically begin with a one-sentence summary followed by a detailed description. We show that this acts as a shortcut during training, concentrating attention on the opening sentence and early tokens and weakening alignment over the rest of the caption. To resolve this, we introduce DeBias-CLIP, which removes the summary sentence during tra
Authors
(none)
Tags
Stats
Related papers
- Long-clip: Unlocking The Long-text Capability Of CLIP (2024)14.90
- C-CLIP: Contrastive Image-text Encoders To Close The Descriptive-commentative Gap (2023)0.00
- Modeling Caption Diversity In Contrastive Vision-language Pretraining (2024)0.00
- CLIPS: An Enhanced CLIP Framework For Learning With Synthetic Captions (2024)0.00
- Superclip: CLIP With Simple Classification Supervision (2025)0.00
- FG-CLIP: Fine-grained Visual And Textual Alignment (2025)5.75
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26
- Clip-lite: Information Efficient Visual Representation Learning With Language Supervision (2021)2.35