Lightclip: Learning Multi-level Interaction For Lightweight Vision-language Models
2023 Β· Ying Nie, Wei He, Kai Han, et al.
Abstract
Vision-language pre-training like CLIP has shown promising performance on various downstream tasks such as zero-shot image classification and image-text retrieval. Most of the existing CLIP-alike works usually adopt relatively large image encoders like ResNet50 and ViT, while the lightweight counterparts are rarely discussed. In this paper, we propose a multi-level interaction paradigm for training lightweight CLIP models. Firstly, to mitigate the problem that some image-text pairs are not strictly one-to-one correspondence, we improve the conventional global instance-level alignment objective by softening the label of negative samples progressively. Secondly, a relaxed bipartite matching based token-level alignment objective is introduced for finer-grained alignment between image patches and textual words. Moreover, based on the observation that the accuracy of CLIP model does not increase correspondingly as the parameters of text encoder increase, an extra objective of masked languag
Authors
(none)
Tags
Stats
Related papers
- Clip-lite: Information Efficient Visual Representation Learning With Language Supervision (2021)2.35
- Uclip: Parameter-efficient Multilingual Extension Of Vision-language Models With Unpaired Data (2025)0.00
- Advancing Myopia To Holism: Fully Contrastive Language-image Pre-training (2024)0.00
- CLIP-PING: Boosting Lightweight Vision-language Models With Proximus Intrinsic Neighbors Guidance (2024)0.00
- Clip-vip: Adapting Pre-trained Image-text Model To Video-language Representation Alignment (2022)5.42
- Linear Alignment Of Vision-language Models For Image Captioning (2023)0.00
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26
- Liteembed: Adapting CLIP To Rare Classes (2026)0.00