Consistencytta: Accelerating Diffusion-based Text-to-audio Generation With Consistency Distillation
2023 Β· Yatong Bai, Trung Dang, Dung Tran, et al.
Abstract
Diffusion models are instrumental in text-to-audio (TTA) generation. Unfortunately, they suffer from slow inference due to an excessive number of queries to the underlying denoising network per generation. To address this bottleneck, we introduce ConsistencyTTA, a framework requiring only a single non-autoregressive network query, thereby accelerating TTA by hundreds of times. We achieve so by proposing "CFG-aware latent consistency model," which adapts consistency generation into a latent space and incorporates classifier-free guidance (CFG) into model training. Moreover, unlike diffusion models, ConsistencyTTA can be finetuned closed-loop with audio-space text-aware metrics, such as CLAP score, to further enhance the generations. Our objective and subjective evaluation on the AudioCaps dataset shows that compared to diffusion-based counterparts, ConsistencyTTA reduces inference computation by 400x while retaining generation quality and diversity.
Authors
(none)
Tags
Stats
Related papers
- Controlaudio: Tackling Text-guided, Timing-indicated And Intelligible Audio Generation Via Progressive Diffusion Modeling (2025)0.00
- Multi-gradspeech: Towards Diffusion-based Multi-speaker Text-to-speech Using Consistent Diffusion Models (2023)0.00
- DCTTS: Discrete Diffusion Model With Contrastive Learning For Text-to-speech Generation (2023)5.72
- Auffusion: Leveraging The Power Of Diffusion And Large Language Models For Text-to-audio Generation (2024)11.19
- CM-TTS: Enhancing Real Time Text-to-speech Synthesis Efficiency Through Weighted Samplers And Consistency Models (2024)5.24
- Fast Text-to-audio Generation With One-step Sampling Via Energy-scoring And Auxiliary Contextual Representation Distillation (2026)0.00
- High-fidelity Speech Synthesis With Minimal Supervision: All Using Diffusion Models (2023)5.24
- Dreamaudio: Customized Text-to-audio Generation With Diffusion Models (2026)0.00