Auffusion: Leveraging The Power Of Diffusion And Large Language Models For Text-to-audio Generation
2024 Β· Jinlong Xue, Yayue Deng, Yingming Gao, et al.
Abstract
Recent advancements in diffusion models and large language models (LLMs) have significantly propelled the field of AIGC. Text-to-Audio (TTA), a burgeoning AIGC application designed to generate audio from natural language prompts, is attracting increasing attention. However, existing TTA studies often struggle with generation quality and text-audio alignment, especially for complex textual inputs. Drawing inspiration from state-of-the-art Text-to-Image (T2I) diffusion models, we introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource. Furthermore, previous studies in T2I recognizes the significant impact of encoder choice on cross-modal alignment, like fine-grained details and object bindings, while similar evaluation is lack
Authors
(none)
Tags
Stats
Related papers
- Controlaudio: Tackling Text-guided, Timing-indicated And Intelligible Audio Generation Via Progressive Diffusion Modeling (2025)0.00
- Audio-agent: Leveraging Llms For Audio Generation, Editing And Composition (2024)0.00
- Text-to-audio Generation Using Instruction-tuned LLM And Latent Diffusion Model (2023)0.00
- Ezaudio: Enhancing Text-to-audio Generation With Efficient Diffusion Transformer (2024)7.50
- Audiocomposer: Towards Fine-grained Audio Generation With Natural Language Descriptions (2024)5.24
- Aadiff: Audio-aligned Video Synthesis With Text-to-image Diffusion (2023)0.00
- Consistencytta: Accelerating Diffusion-based Text-to-audio Generation With Consistency Distillation (2023)6.77
- Dreamaudio: Customized Text-to-audio Generation With Diffusion Models (2026)0.00