CLIMP: Contrastive Language-image Mamba Pretraining
2026 Β· Nimrod Shabtay, Itamar Zimerman, Eli Schwartz, et al.
Abstract
Contrastive Language-Image Pre-training (CLIP) relies on Vision Transformers whose attention mechanism is susceptible to spurious correlations, and scales quadratically with resolution. To address these limitations, We present CLIMP, the first fully Mamba-based contrastive vision-language model that replaces both the vision and text encoders with Mamba. The new architecture encodes sequential structure in both vision and language, with VMamba capturing visual spatial inductive biases, reducing reliance on spurious correlations and producing an embedding space favorable for cross-modal retrieval and out-of-distribution robustness-surpassing OpenAI's CLIP-ViT-B by 7.5% on ImageNet-O. CLIMP naturally supports variable input resolutions without positional encoding interpolation or specialized training, achieving up to 6.6% higher retrieval accuracy at 16x training resolution while using 5x less memory and 1.8x fewer FLOPs. The autoregressive text encoder further overcomes CLIP's fixed cont
Authors
(none)
Tags
Stats
Related papers
- Advancing Myopia To Holism: Fully Contrastive Language-image Pre-training (2024)0.00
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26
- Modeling Caption Diversity In Contrastive Vision-language Pretraining (2024)0.00
- M2-encoder: Advancing Bilingual Image-text Understanding By Large-scale Efficient Pretraining (2024)0.00
- Mobileclip: Fast Image-text Models Through Multi-modal Reinforced Training (2023)18.12
- Clip-lite: Information Efficient Visual Representation Learning With Language Supervision (2021)2.35
- Uclip: Parameter-efficient Multilingual Extension Of Vision-language Models With Unpaired Data (2025)0.00
- CIBR: Cross-modal Information Bottleneck Regularization For Robust CLIP Generalization (2025)4.52