Robustness In Both Domains: CLIP Needs A Robust Text Encoder
2025 Β· Elias Abad Rocamora, Christian Schlarmann, Naman Deep Singh, et al.
Abstract
Adversarial input attacks can cause a significant shift of CLIP embeddings. This can affect the downstream robustness of models incorporating CLIP in the pipeline, such as text-to-image generative models or large vision language models. While some efforts have been done towards making the CLIP image encoders robust, the robustness of text encoders remains unexplored. In this work, we cover this gap in the literature. We propose LEAF: an efficient adversarial finetuning method for the text domain, with the ability to scale to large CLIP models. Our models significantly improve the zero-shot adversarial accuracy in the text domain, while maintaining the vision performance provided by robust image encoders. When combined with text-to-image diffusion models, we can improve the generation quality under adversarial noise. In multimodal retrieval tasks, LEAF improves the recall under adversarial noise over standard CLIP models. Finally, we show that robust text encoders facilitate better reco
Authors
(none)
Tags
Stats
Related papers
- Optimizing CLIP Models For Image Retrieval With Maintained Joint-embedding Alignment (2024)6.34
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26
- Finetuning CLIP To Reason About Pairwise Differences (2024)0.00
- Adversarially Robust CLIP Models Can Induce Better (robust) Perceptual Metrics (2025)3.58
- Finelip: Extending Clip's Reach Via Fine-grained Alignment With Longer Text Inputs (2025)6.34
- Long-clip: Unlocking The Long-text Capability Of CLIP (2024)14.90
- Robust Cross-modal Representation Learning With Progressive Self-distillation (2022)12.33
- Mobileclip: Fast Image-text Models Through Multi-modal Reinforced Training (2023)18.12