Clip-lite: Information Efficient Visual Representation Learning With Language Supervision
2021 Β· Aman Shrivastava, Ramprasaath R. Selvaraju, Nikhil Naik, et al.
Abstract
We propose CLIP-Lite, an information efficient method for visual representation learning by feature alignment with textual annotations. Compared to the previously proposed CLIP model, CLIP-Lite requires only one negative image-text sample pair for every positive image-text sample during the optimization of its contrastive learning objective. We accomplish this by taking advantage of an information efficient lower-bound to maximize the mutual information between the two input modalities. This allows CLIP-Lite to be trained with significantly reduced amounts of data and batch sizes while obtaining better performance than CLIP at the same scale. We evaluate CLIP-Lite by pretraining on the COCO-Captions dataset and testing transfer learning to other datasets. CLIP-Lite obtains a +14.0% mAP absolute gain in performance on Pascal VOC classification, and a +22.1% top-1 accuracy gain on ImageNet, while being comparable or superior to other, more complex, text-supervised models. CLIP-Lite is al
Authors
(none)
Tags
Stats
Related papers
- Superclip: CLIP With Simple Classification Supervision (2025)0.00
- Lightclip: Learning Multi-level Interaction For Lightweight Vision-language Models (2023)0.00
- Linear Alignment Of Vision-language Models For Image Captioning (2023)0.00
- Robust Cross-modal Representation Learning With Progressive Self-distillation (2022)12.33
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26
- Contextclip: Contextual Alignment Of Image-text Pairs On CLIP Visual Representations (2022)5.84
- Modeling Caption Diversity In Contrastive Vision-language Pretraining (2024)0.00
- CLIPS: An Enhanced CLIP Framework For Learning With Synthetic Captions (2024)0.00