C-CLIP: Contrastive Image-text Encoders To Close The Descriptive-commentative Gap
2023 Β· William Theisen, Walter Scheirer
Abstract
The interplay between the image and comment on a social media post is one of high importance for understanding its overall message. Recent strides in multimodal embedding models, namely CLIP, have provided an avenue forward in relating image and text. However the current training regime for CLIP models is insufficient for matching content found on social media, regardless of site or language. Current CLIP training data is based on what we call ``descriptive'' text: text in which an image is merely described. This is something rarely seen on social media, where the vast majority of text content is ``commentative'' in nature. The captions provide commentary and broader context related to the image, rather than describing what is in it. Current CLIP models perform poorly on retrieval tasks where image-caption pairs display a commentative relationship. Closing this gap would be beneficial for several important application areas related to social media. For instance, it would allow groups f
Authors
(none)
Tags
Stats
Related papers
- CLIP Is Shortsighted: Paying Attention Beyond The First Sentence (2026)0.00
- Optimizing CLIP Models For Image Retrieval With Maintained Joint-embedding Alignment (2024)6.34
- CLIPS: An Enhanced CLIP Framework For Learning With Synthetic Captions (2024)0.00
- Finetuning CLIP To Reason About Pairwise Differences (2024)0.00
- Contextclip: Contextual Alignment Of Image-text Pairs On CLIP Visual Representations (2022)5.84
- Advancing Myopia To Holism: Fully Contrastive Language-image Pre-training (2024)0.00
- Tripletclip: Improving Compositional Reasoning Of CLIP Via Synthetic Vision-language Negatives (2024)4.52
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26