Edmsound: Spectrogram Based Diffusion Models For Efficient And High-quality Audio Synthesis
2023 Β· Ge Zhu, Yutong Wen, Marc-AndrΓ© Carbonneau, et al.
Abstract
Audio diffusion models can synthesize a wide variety of sounds. Existing models often operate on the latent domain with cascaded phase recovery modules to reconstruct waveform. This poses challenges when generating high-fidelity audio. In this paper, we propose EDMSound, a diffusion-based generative model in spectrogram domain under the framework of elucidated diffusion models (EDM). Combining with efficient deterministic sampler, we achieved similar Fr\'echet audio distance (FAD) score as top-ranked baseline with only 10 steps and reached state-of-the-art performance with 50 steps on the DCASE2023 foley sound generation benchmark. We also revealed a potential concern regarding diffusion based audio generation models that they tend to generate samples with high perceptual similarity to the data from training data. Project page: https://agentcooper2002.github.io/EDMSound/
Authors
(none)
Tags
Stats
Related papers
- Audio Generation Through Score-based Generative Modeling: Design Principles And Implementation (2025)1.91
- Diff-sage: End-to-end Spatial Audio Generation Using Diffusion Models (2024)6.34
- Diffar: Denoising Diffusion Autoregressive Model For Raw Speech Waveform Generation (2023)0.00
- Diffwave: A Versatile Diffusion Model For Audio Synthesis (2020)0.00
- Fastdiff: A Fast Conditional Diffusion Model For High-quality Speech Synthesis (2022)14.35
- Score Distillation Sampling For Audio: Source Separation, Synthesis, And Beyond (2025)0.00
- Immersediffusion: A Generative Spatial Audio Latent Diffusion Model (2024)0.00
- BDDM: Bilateral Denoising Diffusion Models For Fast And High-quality Speech Synthesis (2022)4.76