Diff-sage: End-to-end Spatial Audio Generation Using Diffusion Models
2024 Β· Saksham Singh Kushwaha, Jianbo Ma, Mark R. P. Thomas, et al.
Abstract
Spatial audio is a crucial component in creating immersive experiences. Traditional simulation-based approaches to generate spatial audio rely on expertise, have limited scalability, and assume independence between semantic and spatial information. To address these issues, we explore end-to-end spatial audio generation. We introduce and formulate a new task of generating first-order Ambisonics (FOA) given a sound category and sound source spatial location. We propose Diff-SAGe, an end-to-end, flow-based diffusion-transformer model for this task. Diff-SAGe utilizes a complex spectrogram representation for FOA, preserving the phase information crucial for accurate spatial cues. Additionally, a multi-conditional encoder integrates the input conditions into a unified representation, guiding the generation of FOA waveforms from noise. Through extensive evaluations on two datasets, we demonstrate that our method consistently outperforms traditional simulation-based baselines across both obje
Authors
(none)
Tags
Stats
Related papers
- Immersediffusion: A Generative Spatial Audio Latent Diffusion Model (2024)0.00
- Edmsound: Spectrogram Based Diffusion Models For Efficient And High-quality Audio Synthesis (2023)0.00
- Audio Generation Through Score-based Generative Modeling: Design Principles And Implementation (2025)1.91
- Score Distillation Sampling For Audio: Source Separation, Synthesis, And Beyond (2025)0.00
- Diff-foley: Synchronized Video-to-audio Synthesis With Latent Diffusion Models (2023)0.00
- Diffar: Denoising Diffusion Autoregressive Model For Raw Speech Waveform Generation (2023)0.00
- Investigating The Design Space Of Diffusion Models For Speech Enhancement (2023)10.07
- Audiomog: Guiding Audio Generation With Mixture-of-guidance (2025)0.00