CLIP The Bias: How Useful Is Balancing Data In Multimodal Learning?
2024 Β· Ibrahim Alabdulmohsin, Xiao Wang, Andreas Steiner, et al.
Abstract
We study the effectiveness of data-balancing for mitigating biases in contrastive language-image pretraining (CLIP), identifying areas of strength and limitation. First, we reaffirm prior conclusions that CLIP models can inadvertently absorb societal stereotypes. To counter this, we present a novel algorithm, called Multi-Modal Moment Matching (M4), designed to reduce both representation and association biases (i.e. in first- and second-order statistics) in multimodal data. We use M4 to conduct an in-depth analysis taking into account various factors, such as the model, representation, and data size. Our study also explores the dynamic nature of how CLIP learns and unlearns biases. In particular, we find that fine-tuning is effective in countering representation biases, though its impact diminishes for association biases. Also, data balancing has a mixed impact on quality: it tends to improve classification but can hurt retrieval. Interestingly, data and architectural improvements seem
Authors
(none)
Tags
Stats
Related papers
- CLIP Is Shortsighted: Paying Attention Beyond The First Sentence (2026)0.00
- Advancing Myopia To Holism: Fully Contrastive Language-image Pre-training (2024)0.00
- Explaining And Mitigating The Modality Gap In Contrastive Multimodal Learning (2024)0.00
- Cross The Gap: Exposing The Intra-modal Misalignment In CLIP Via Modality Inversion (2025)3.64
- Fairclip: Social Bias Elimination Based On Attribute Prototype Learning And Representation Neutralization (2022)0.00
- Clip-moe: Towards Building Mixture Of Experts For CLIP With Diversified Multiplet Upcycling (2024)2.26
- CIBR: Cross-modal Information Bottleneck Regularization For Robust CLIP Generalization (2025)4.52
- Modeling Caption Diversity In Contrastive Vision-language Pretraining (2024)0.00