Drcap: Decoding CLAP Latents With Retrieval-augmented Generation For Zero-shot Audio Captioning
2024 Β· Xiquan Li, Wenxi Chen, Ziyang Ma, et al.
Abstract
While automated audio captioning (AAC) has made notable progress, traditional fully supervised AAC models still face two critical challenges: the need for expensive audio-text pair data for training and performance degradation when transferring across domains. To overcome these limitations, we present DRCap, a data-efficient and flexible zero-shot audio captioning system that requires text-only data for training and can quickly adapt to new domains without additional fine-tuning. DRCap integrates a contrastive language-audio pre-training (CLAP) model and a large-language model (LLM) as its backbone. During training, the model predicts the ground-truth caption with a fixed text encoder from CLAP, whereas, during inference, the text encoder is replaced with the audio encoder to generate captions for audio clips in a zero-shot manner. To mitigate the modality gap of the CLAP model, we use both the projection strategy from the encoder side and the retrieval-augmented generation strategy fr
Authors
(none)
Tags
Stats
Related papers
- RECAP: Retrieval-augmented Audio Captioning (2023)9.41
- Audiosetcaps: An Enriched Audio-caption Dataset Using Automated Generation Pipeline With Large Audio And Language Models (2024)13.44
- Enclap: Combining Neural Audio Codec And Audio-text Joint Embedding For Automated Audio Captioning (2024)14.03
- M2D-CLAP: Masked Modeling Duo Meets CLAP For Learning General-purpose Audio-language Representation (2024)7.81
- CLAIR-A: Leveraging Large Language Models To Judge Audio Captions (2024)2.00
- SLAM-AAC: Enhancing Audio Captioning With Paraphrasing Augmentation And Clap-refine Through Llms (2024)0.00
- Interactive Audio-text Representation For Automated Audio Captioning With Contrastive Learning (2022)0.00
- Retrieval-augmented Text-to-audio Generation (2023)0.00