They're All Doctors: Synthesizing Diverse Counterfactuals To Mitigate Associative Bias
2024 Β· Salma Abdel Magid, Jui-Hsien Wang, Kushal Kafle, et al.
Abstract
Vision Language Models (VLMs) such as CLIP are powerful models; however they can exhibit unwanted biases, making them less safe when deployed directly in applications such as text-to-image, text-to-video retrievals, reverse search, or classification tasks. In this work, we propose a novel framework to generate synthetic counterfactual images to create a diverse and balanced dataset that can be used to fine-tune CLIP. Given a set of diverse synthetic base images from text-to-image models, we leverage off-the-shelf segmentation and inpainting models to place humans with diverse visual appearances in context. We show that CLIP trained on such datasets learns to disentangle the human appearance from the context of an image, i.e., what makes a doctor is not correlated to the person's visual appearance, like skin color or body type, but to the context, such as background, the attire they are wearing, or the objects they are holding. We demonstrate that our fine-tuned CLIP model, \(CF_\alpha\
Authors
(none)
Tags
Stats
Related papers
- Advancing Myopia To Holism: Fully Contrastive Language-image Pre-training (2024)0.00
- Modeling Caption Diversity In Contrastive Vision-language Pretraining (2024)0.00
- Finetuning CLIP To Reason About Pairwise Differences (2024)0.00
- Tripletclip: Improving Compositional Reasoning Of CLIP Via Synthetic Vision-language Negatives (2024)4.52
- Medclip: Contrastive Learning From Unpaired Medical Images And Text (2022)26.02
- Calibclip: Contextual Calibration Of Dominant Semantics For Text-driven Image Retrieval (2025)0.00
- Fairclip: Social Bias Elimination Based On Attribute Prototype Learning And Representation Neutralization (2022)0.00
- Safe-clip: Removing NSFW Concepts From Vision-and-language Models (2023)13.41