Retrieval-augmented Text-to-audio Generation
2023 Β· Yi Yuan, Haohe Liu, Xubo Liu, et al.
Abstract
Despite recent progress in text-to-audio (TTA) generation, we show that the state-of-the-art models, such as AudioLDM, trained on datasets with an imbalanced class distribution, such as AudioCaps, are biased in their generation performance. Specifically, they excel in generating common audio classes while underperforming in the rare ones, thus degrading the overall generation performance. We refer to this problem as long-tailed text-to-audio generation. To address this issue, we propose a simple retrieval-augmented approach for TTA models. Specifically, given an input text prompt, we first leverage a Contrastive Language Audio Pretraining (CLAP) model to retrieve relevant text-audio pairs. The features of the retrieved audio-text data are then used as additional conditions to guide the learning of TTA models. We enhance AudioLDM with our proposed approach and denote the resulting augmented system as Re-AudioLDM. On the AudioCaps dataset, Re-AudioLDM achieves a state-of-the-art Frechet
Authors
(none)
Tags
Stats
Related papers
- Retrieval Augmented Generation In Prompt-based Text-to-speech Synthesis With Context-aware Contrastive Language-audio Pretraining (2024)0.00
- Audiorag+: Feedback-driven Retrieval-augmented Audio Generation With Large Audio Language Models (2025)0.00
- RECAP: Retrieval-augmented Audio Captioning (2023)9.41
- Drcap: Decoding CLAP Latents With Retrieval-augmented Generation For Zero-shot Audio Captioning (2024)6.34
- Audiosetcaps: An Enriched Audio-caption Dataset Using Automated Generation Pipeline With Large Audio And Language Models (2024)13.44
- Text-to-audio Generation Using Instruction-tuned LLM And Latent Diffusion Model (2023)0.00
- Enhancing Retrieval-augmented Audio Captioning With Generation-assisted Multimodal Querying And Progressive Learning (2024)3.58
- Audiobox TTA-RAG: Improving Zero-shot And Few-shot Text-to-audio With Retrieval-augmented Generation (2024)5.24