Text-to-audio Generation Using Instruction-tuned LLM And Latent Diffusion Model
2023 Β· Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, et al.
Abstract
The immense scale of the recent large language models (LLM) allows many interesting properties, such as, instruction- and chain-of-thought-based fine-tuning, that has significantly improved zero- and few-shot performance in many natural language processing (NLP) tasks. Inspired by such successes, we adopt such an instruction-tuned LLM Flan-T5 as the text encoder for text-to-audio (TTA) generation -- a task where the goal is to generate an audio from its textual description. The prior works on TTA either pre-trained a joint text-audio encoder or used a non-instruction-tuned model, such as, T5. Consequently, our latent diffusion model (LDM)-based approach TANGO outperforms the state-of-the-art AudioLDM on most metrics and stays comparable on the rest on AudioCaps test set, despite training the LDM on a 63 times smaller dataset and keeping the text encoder frozen. This improvement might also be attributed to the adoption of audio pressure level-based sound mixing for training set augmenta
Authors
(none)
Tags
Stats
Related papers
- Auffusion: Leveraging The Power Of Diffusion And Large Language Models For Text-to-audio Generation (2024)11.19
- Tangoflux: Super Fast And Faithful Text To Audio Generation With Flow Matching And Clap-ranked Preference Optimization (2024)0.00
- Controlaudio: Tackling Text-guided, Timing-indicated And Intelligible Audio Generation Via Progressive Diffusion Modeling (2025)0.00
- Audiocomposer: Towards Fine-grained Audio Generation With Natural Language Descriptions (2024)5.24
- Audio-agent: Leveraging Llms For Audio Generation, Editing And Composition (2024)0.00
- Retrieval-augmented Text-to-audio Generation (2023)0.00
- Diffusion Based Text-to-music Generation With Global And Local Text Based Conditioning (2025)0.00
- Consistencytta: Accelerating Diffusion-based Text-to-audio Generation With Consistency Distillation (2023)6.77