Spacevlm: Sub-space Modeling Of Negation In Vision-language Models
2025 Β· Sepehr Kazemi Ranjbar, Kumail Alhamoud, Marzyeh Ghassemi
Abstract
Vision-Language Models (VLMs) struggle with negation. Given a prompt like "retrieve (or generate) a street scene without pedestrians," they often fail to respect the "not." Existing methods address this limitation by fine-tuning on large negation datasets, but such retraining often compromises the model's zero-shot performance on affirmative prompts. We show that the embedding space of VLMs, such as CLIP, can be divided into semantically consistent subspaces. Based on this property, we propose a training-free framework that models negation as a subspace in the joint embedding space rather than a single point (Figure 1). To find the matching image for a caption such as "A but not N," we construct two spherical caps around the embeddings of A and N, and we score images by the central direction of the region that is close to A and far from N. Across retrieval, MCQ, and text-to-image tasks, our method improves negation understanding by about 30% on average over prior methods. It closes the
Authors
(none)
Tags
Stats
Related papers
- Tng-clip:training-time Negation Data Generation For Negation Awareness Of CLIP (2025)0.00
- Contrastive Vision-language Learning With Paraphrasing And Negation (2025)0.00
- Contrasting Intra-modal And Ranking Cross-modal Hard Negatives To Enhance Visio-linguistic Compositional Understanding (2023)12.11
- Towards Effective Negation Modeling In Joint Audio-text Models For Music (2026)0.00
- The Effect Of Negation On CLIP In Medical Imaging: Limitations Of Contrastive Language-image Pretraining (2025)0.00
- VSE++: Improving Visual-semantic Embeddings With Hard Negatives (2017)0.00
- Lost In Embeddings: Information Loss In Vision-language Models (2025)0.00
- ARGENT: Adaptive Hierarchical Image-text Representations (2026)0.00