Fast Text-to-audio Generation With One-step Sampling Via Energy-scoring And Auxiliary Contextual Representation Distillation

Abstract

arXiv:2605.00329v1 Announce Type: new Abstract: Autoregressive (AR) models with diffusion heads have recently achieved strong text-to-audio performance, yet their iterative decoding and multi-step sampling process introduce high-latency issues. To address this bottleneck, we propose a one-step sampling framework that combines an energy-distance training objective with representation-level distillation. An energy-scoring head maps Gaussian noise directly to audio latents in one step, eliminating the need for a costly recursive diffusion sampling process, while distillation from a masked autoregressive (MAR) text-to-audio model preserves the strong conditioning learned during diffusion training. On the AudioCaps benchmark, our method consistently outperforms prior one-step baselines such as ConsistencyTTA, SoundCTM, AudioLCM and AudioTurbo, on both objective and subjective metrics, while substantially narrowing the quality gap to AR diffusion systems with multi-step sampling. Compared t

Fast Text-to-audio Generation With One-step Sampling Via Energy-scoring And Auxiliary Contextual Representation Distillation

Abstract

Authors

Tags

Stats

Related papers