Fast Text-to-audio Generation With One-step Sampling Via Energy-scoring And Auxiliary Contextual Representation Distillation
2026 Β· Kuan-Po Huang, Bo-Ru Lu, Byeonggeun Kim, et al.
Abstract
arXiv:2605.00329v1 Announce Type: new Abstract: Autoregressive (AR) models with diffusion heads have recently achieved strong text-to-audio performance, yet their iterative decoding and multi-step sampling process introduce high-latency issues. To address this bottleneck, we propose a one-step sampling framework that combines an energy-distance training objective with representation-level distillation. An energy-scoring head maps Gaussian noise directly to audio latents in one step, eliminating the need for a costly recursive diffusion sampling process, while distillation from a masked autoregressive (MAR) text-to-audio model preserves the strong conditioning learned during diffusion training. On the AudioCaps benchmark, our method consistently outperforms prior one-step baselines such as ConsistencyTTA, SoundCTM, AudioLCM and AudioTurbo, on both objective and subjective metrics, while substantially narrowing the quality gap to AR diffusion systems with multi-step sampling. Compared t
Authors
(none)
Tags
Stats
Related papers
- Flashaudio: Rectified Flows For Fast And High-fidelity Text-to-audio Generation (2024)5.13
- Controlaudio: Tackling Text-guided, Timing-indicated And Intelligible Audio Generation Via Progressive Diffusion Modeling (2025)0.00
- Consistencytta: Accelerating Diffusion-based Text-to-audio Generation With Consistency Distillation (2023)6.77
- Dreamaudio: Customized Text-to-audio Generation With Diffusion Models (2026)0.00
- Livespeech: Low-latency Zero-shot Text-to-speech Via Autoregressive Modeling Of Audio Discrete Codes (2024)5.84
- Audiotoken: Adaptation Of Text-conditioned Diffusion Models For Audio-to-image Generation (2023)9.76
- Score Distillation Sampling For Audio: Source Separation, Synthesis, And Beyond (2025)0.00
- Audio-agent: Leveraging Llms For Audio Generation, Editing And Composition (2024)0.00