PAT: Parameter-free Audio-text Aligner To Boost Zero-shot Audio Classification
2024 Β· Ashish Seth, Ramaneswaran Selvakumar, Sonal Kumar, et al.
Abstract
Audio-Language Models (ALMs) have demonstrated remarkable performance in zero-shot audio classification. In this paper, we introduce PAT (Parameter-free Audio-Text aligner), a simple and training-free method aimed at boosting the zero-shot audio classification performance of CLAP-like ALMs. To achieve this, we propose to improve the cross-modal interaction between audio and language modalities by enhancing the representations for both modalities using mutual feedback. Precisely, to enhance textual representations, we propose a prompt ensemble algorithm that automatically selects and combines the most relevant prompts from a datastore with a large pool of handcrafted prompts and weighs them according to their relevance to the audio. On the other hand, to enhance audio representations, we reweigh the frame-level audio features based on the enhanced textual information. Our proposed method does not require any additional modules or parameters and can be used with any existing CLAP-like AL
Authors
(none)
Tags
Stats
Related papers
- Do Audio-language Models Understand Linguistic Variations? (2024)0.00
- M2D-CLAP: Masked Modeling Duo Meets CLAP For Learning General-purpose Audio-language Representation (2024)7.81
- MATS: An Audio Language Model Under Text-only Supervision (2025)0.00
- Multiple Consistency-guided Test-time Adaptation For Contrastive Audio-language Models With Unlabeled Audio (2024)2.26
- Drcap: Decoding CLAP Latents With Retrieval-augmented Generation For Zero-shot Audio Captioning (2024)6.34
- Retrieval-augmented Text-to-audio Generation (2023)0.00
- Clapspeech: Learning Prosody From Text Context With Contrastive Language-audio Pre-training (2023)0.00
- Connecting The Dots Between Audio And Text Without Parallel Data Through Visual Knowledge Transfer (2021)8.09