Audiobox TTA-RAG: Improving Zero-shot And Few-shot Text-to-audio With Retrieval-augmented Generation
2024 Β· Mu Yang, Bowen Shi, Matthew Le, et al.
Abstract
This work focuses on improving Text-To-Audio (TTA) generation on zero-shot and few-shot settings (i.e. generating unseen or uncommon audio events). Inspired by the success of Retrieval-Augmented Generation (RAG) in Large Language Models, we propose Audiobox TTA-RAG, a novel retrieval-augmented TTA approach based on Audiobox, a flow-matching audio generation model. Unlike the vanilla Audiobox TTA solution that generates audio conditioned on text only, we extend the TTA process by augmenting the conditioning input with both text and retrieved audio samples. Our retrieval method does not require the external database to have labeled audio, offering more practical use cases. We show that the proposed model can effectively leverage the retrieved audio samples and significantly improve zero-shot and few-shot TTA performance, with large margins on multiple evaluation metrics, while maintaining the ability to generate semantically aligned audio for the in-domain setting.
Authors
(none)
Tags
Stats
Related papers
- Audiorag+: Feedback-driven Retrieval-augmented Audio Generation With Large Audio Language Models (2025)0.00
- Retrieval-augmented Text-to-audio Generation (2023)0.00
- Wavrag: Audio-integrated Retrieval Augmented Generation For Spoken Dialogue Models (2025)5.24
- Audiogen: Textually Guided Audio Generation (2022)0.00
- Auffusion: Leveraging The Power Of Diffusion And Large Language Models For Text-to-audio Generation (2024)11.19
- Audio-agent: Leveraging Llms For Audio Generation, Editing And Composition (2024)0.00
- Retrieval Augmented Generation In Prompt-based Text-to-speech Synthesis With Context-aware Contrastive Language-audio Pretraining (2024)0.00
- Enhancing Retrieval-augmented Audio Captioning With Generation-assisted Multimodal Querying And Progressive Learning (2024)3.58