Towards Effective Negation Modeling In Joint Audio-text Models For Music
2026 Β· Yannis Vasilakis, Rachel Bittner, Johan Pauwels
Abstract
Joint audio-text models are widely used for music retrieval, yet they struggle with semantic phenomena such as negation. Negation is fundamental for distinguishing the absence (or presence) of musical elements (e.g., "with vocals" vs. "without vocals"), but current systems fail to represent this reliably. In this work, we investigate and mitigate this limitation by training CLAP models from scratch on the Million Song Dataset with LP-MusicCaps-MSD captions. We introduce negation through text augmentation and a dissimilarity-based contrastive loss, designed to explicitly separate original and negated captions in the joint embedding space. To evaluate progress, we propose two protocols that frame negation modeling as retrieval and binary classification tasks. Experiments demonstrate that both methods, individually and combined, improve negation handling while largely preserving retrieval performance.
Authors
(none)
Tags
Stats
Related papers
- Spacevlm: Sub-space Modeling Of Negation In Vision-language Models (2025)0.00
- Tng-clip:training-time Negation Data Generation For Negation Awareness Of CLIP (2025)0.00
- Contrastive Vision-language Learning With Paraphrasing And Negation (2025)0.00
- The Effect Of Negation On CLIP In Medical Imaging: Limitations Of Contrastive Language-image Pretraining (2025)0.00
- Contrasting Intra-modal And Ranking Cross-modal Hard Negatives To Enhance Visio-linguistic Compositional Understanding (2023)12.11
- No Captions, No Problem: Captionless 3D-CLIP Alignment With Hard Negatives Via CLIP Knowledge And Llms (2024)0.00
- Contrastive Audio-language Learning For Music (2022)0.00
- Tripletclip: Improving Compositional Reasoning Of CLIP Via Synthetic Vision-language Negatives (2024)4.52