Controlaudio: Tackling Text-guided, Timing-indicated And Intelligible Audio Generation Via Progressive Diffusion Modeling
2025 Β· Yuxuan Jiang, Zehua Chen, Zeqian Ju, et al.
Abstract
Text-to-audio (TTA) generation with fine-grained control signals, e.g., precise timing control or intelligible speech content, has been explored in recent works. However, constrained by data scarcity, their generation performance at scale is still compromised. In this study, we recast controllable TTA generation as a multi-task learning problem and introduce a progressive diffusion modeling approach, ControlAudio. Our method adeptly fits distributions conditioned on more fine-grained information, including text, timing, and phoneme features, through a step-by-step strategy. First, we propose a data construction method spanning both annotation and simulation, augmenting condition information in the sequence of text, timing, and phoneme. Second, at the model training stage, we pretrain a diffusion transformer (DiT) on large-scale text-audio pairs, achieving scalable TTA generation, and then incrementally integrate the timing and phoneme features with unified semantic representations, exp
Authors
(none)
Tags
Stats
Related papers
- Consistencytta: Accelerating Diffusion-based Text-to-audio Generation With Consistency Distillation (2023)6.77
- Dreamaudio: Customized Text-to-audio Generation With Diffusion Models (2026)0.00
- Audiocomposer: Towards Fine-grained Audio Generation With Natural Language Descriptions (2024)5.24
- Auffusion: Leveraging The Power Of Diffusion And Large Language Models For Text-to-audio Generation (2024)11.19
- Ezaudio: Enhancing Text-to-audio Generation With Efficient Diffusion Transformer (2024)7.50
- Audio-agent: Leveraging Llms For Audio Generation, Editing And Composition (2024)0.00
- Degdit: Controllable Audio Generation With Dynamic Event Graph Guided Diffusion Transformer (2025)0.00
- Text-to-audio Generation Using Instruction-tuned LLM And Latent Diffusion Model (2023)0.00