Do Audio-language Models Understand Linguistic Variations?
2024 Β· Ramaneswaran Selvakumar, Sonal Kumar, Hemant Kumar Giri, et al.
Abstract
Open-vocabulary audio language models (ALMs), like Contrastive Language Audio Pretraining (CLAP), represent a promising new paradigm for audio-text retrieval using natural language queries. In this paper, for the first time, we perform controlled experiments on various benchmarks to show that existing ALMs struggle to generalize to linguistic variations in textual queries. To address this issue, we propose RobustCLAP, a novel and compute-efficient technique to learn audio-language representations agnostic to linguistic variations. Specifically, we reformulate the contrastive loss used in CLAP architectures by introducing a multi-view contrastive learning objective, where paraphrases are treated as different views of the same audio scene and use this for training. Our proposed approach improves the text-to-audio retrieval performance of CLAP by 0.8%-13% across benchmarks and enhances robustness to linguistic variation.
Authors
(none)
Tags
Stats
Related papers
- CLASP: Contrastive Language-speech Pretraining For Multilingual Multimodal Information Retrieval (2024)0.00
- Human-clap: Human-perception-based Contrastive Language-audio Pretraining (2025)4.52
- PAT: Parameter-free Audio-text Aligner To Boost Zero-shot Audio Classification (2024)0.00
- MATS: An Audio Language Model Under Text-only Supervision (2025)0.00
- M2D-CLAP: Masked Modeling Duo Meets CLAP For Learning General-purpose Audio-language Representation (2024)7.81
- On The Language Encoder Of Contrastive Cross-modal Models (2023)0.00
- Clapspeech: Learning Prosody From Text Context With Contrastive Language-audio Pre-training (2023)0.00
- CALM: Contrastive Aligned Audio-language Multirate And Multimodal Representations (2022)0.00