RECAP: Retrieval-augmented Audio Captioning
2023 Β· Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, et al.
Abstract
We present RECAP (REtrieval-Augmented Audio CAPtioning), a novel and effective audio captioning system that generates captions conditioned on an input audio and other captions similar to the audio retrieved from a datastore. Additionally, our proposed method can transfer to any domain without the need for any additional fine-tuning. To generate a caption for an audio sample, we leverage an audio-text model CLAP to retrieve captions similar to it from a replaceable datastore, which are then used to construct a prompt. Next, we feed this prompt to a GPT-2 decoder and introduce cross-attention layers between the CLAP encoder and GPT-2 to condition the audio for caption generation. Experiments on two benchmark datasets, Clotho and AudioCaps, show that RECAP achieves competitive performance in in-domain settings and significant improvements in out-of-domain settings. Additionally, due to its capability to exploit a large text-captions-only datastore in a training-free fashion, RECAP shows u
Authors
(none)
Tags
Stats
Related papers
- Drcap: Decoding CLAP Latents With Retrieval-augmented Generation For Zero-shot Audio Captioning (2024)6.34
- Advancing Natural-language Based Audio Retrieval With Passt And Large Audio-caption Data Sets (2023)0.00
- Enhancing Retrieval-augmented Audio Captioning With Generation-assisted Multimodal Querying And Progressive Learning (2024)3.58
- Retrieval-augmented Text-to-audio Generation (2023)0.00
- Audiosetcaps: An Enriched Audio-caption Dataset Using Automated Generation Pipeline With Large Audio And Language Models (2024)13.44
- Interactive Audio-text Representation For Automated Audio Captioning With Contrastive Learning (2022)0.00
- An Encoder-decoder Based Audio Captioning System With Transfer And Reinforcement Learning (2021)0.00
- Sound-vecaps: Improving Audio Generation With Visual Enhanced Captions (2024)7.16