Disentangling Visual And Written Concepts In CLIP
2022 Β· Joanna Materzynska, Antonio Torralba, David Bau
Abstract
The CLIP network measures the similarity between natural text and images; in this work, we investigate the entanglement of the representation of word images and natural images in its image encoder. First, we find that the image encoder has an ability to match word images with natural images of scenes described by those words. This is consistent with previous research that suggests that the meaning and the spelling of a word might be entangled deep within the network. On the other hand, we also find that CLIP has a strong ability to match nonsense words, suggesting that processing of letters is separated from processing of their meaning. To explicitly determine whether the spelling capability of CLIP is separable, we devise a procedure for identifying representation subspaces that selectively isolate or eliminate spelling capabilities. We benchmark our methods against a range of retrieval tasks, and we also test them by measuring the appearance of text in CLIP-guided generated images. W
Authors
(none)
Tags
Stats
Related papers
- Finetuning CLIP To Reason About Pairwise Differences (2024)0.00
- Contextclip: Contextual Alignment Of Image-text Pairs On CLIP Visual Representations (2022)5.84
- C-CLIP: Contrastive Image-text Encoders To Close The Descriptive-commentative Gap (2023)0.00
- Optimizing CLIP Models For Image Retrieval With Maintained Joint-embedding Alignment (2024)6.34
- Decomposing And Interpreting Image Representations Via Text In Vits Beyond CLIP (2024)7.28
- Isoclip: Decomposing CLIP Projectors For Efficient Intra-modal Alignment (2026)3.06
- Fine-tuning CLIP Text Encoders With Two-step Paraphrasing (2024)2.26
- CLIP Is Shortsighted: Paying Attention Beyond The First Sentence (2026)0.00