Half-truths Break Similarity-based Retrieval
2026 Β· Bora Kargi, Arnas Uselis, Seong Joon Oh
Abstract
When a text description is extended with an additional detail, image-text similarity should drop if that detail is wrong. We show that CLIP-style dual encoders often violate this intuition: appending a plausible but incorrect object or relation to an otherwise correct description can increase the similarity score. We call such cases half-truths. On COCO, CLIP prefers the correct shorter description only 40.6% of the time, and performance drops to 32.9% when the added detail is a relation. We trace this vulnerability to weak supervision on caption parts: contrastive training aligns full sentences but does not explicitly enforce that individual entities and relations are grounded. We propose CS-CLIP (Component-Supervised CLIP), which decomposes captions into entity and relation units, constructs a minimally edited foil for each unit, and fine-tunes the model to score the correct unit above its foil while preserving standard dual-encoder inference. CS-CLIP raises half-truth accuracy to 69
Authors
(none)
Tags
Stats
Related papers
- CLIP Is Shortsighted: Paying Attention Beyond The First Sentence (2026)0.00
- Finetuning CLIP To Reason About Pairwise Differences (2024)0.00
- C-CLIP: Contrastive Image-text Encoders To Close The Descriptive-commentative Gap (2023)0.00
- Optimizing CLIP Models For Image Retrieval With Maintained Joint-embedding Alignment (2024)6.34
- Clip-lite: Information Efficient Visual Representation Learning With Language Supervision (2021)2.35
- Distill CLIP (DCLIP): Enhancing Image-text Retrieval Via Cross-modal Transformer Distillation (2025)0.00
- CLIPS: An Enhanced CLIP Framework For Learning With Synthetic Captions (2024)0.00
- Captured By Captions: On Memorization And Its Mitigation In CLIP Models (2025)0.00