Ezaudio: Enhancing Text-to-audio Generation With Efficient Diffusion Transformer
2024 Β· Jiarui Hai, Yong Xu, Hao Zhang, et al.
Abstract
We introduce EzAudio, a text-to-audio (T2A) generation framework designed to produce high-quality, natural-sounding sound effects. Core designs include: (1) We propose EzAudio-DiT, an optimized Diffusion Transformer (DiT) designed for audio latent representations, improving convergence speed, as well as parameter and memory efficiency. (2) We apply a classifier-free guidance (CFG) rescaling technique to mitigate fidelity loss at higher CFG scores and enhancing prompt adherence without compromising audio quality. (3) We propose a synthetic caption generation strategy leveraging recent advances in audio understanding and LLMs to enhance T2A pretraining. We show that EzAudio, with its computationally efficient architecture and fast convergence, is a competitive open-source model that excels in both objective and subjective evaluations by delivering highly realistic listening experiences. Code, data, and pre-trained models are released at: https://haidog-yaqub.github.io/EzAudio-Page/.
Authors
(none)
Tags
Stats
Related papers
- Auffusion: Leveraging The Power Of Diffusion And Large Language Models For Text-to-audio Generation (2024)11.19
- Controlaudio: Tackling Text-guided, Timing-indicated And Intelligible Audio Generation Via Progressive Diffusion Modeling (2025)0.00
- Dreamaudio: Customized Text-to-audio Generation With Diffusion Models (2026)0.00
- ETTA: Elucidating The Design Space Of Text-to-audio Models (2024)0.00
- 3mdit: Unified Tri-modal Diffusion Transformer For Text-driven Synchronized Audio-video Generation (2025)0.00
- Degdit: Controllable Audio Generation With Dynamic Event Graph Guided Diffusion Transformer (2025)0.00
- Aadiff: Audio-aligned Video Synthesis With Text-to-image Diffusion (2023)0.00
- Audiotoken: Adaptation Of Text-conditioned Diffusion Models For Audio-to-image Generation (2023)9.76