Diffusion-based Unsupervised Audio-visual Speech Enhancement
2024 Β· Jean-Eudes Ayilo, Mostafa Sadeghi, Romain Serizel, et al.
Abstract
This paper proposes a new unsupervised audio-visual speech enhancement (AVSE) approach that combines a diffusion-based audio-visual speech generative model with a non-negative matrix factorization (NMF) noise model. First, the diffusion model is pre-trained on clean speech conditioned on corresponding video data to simulate the speech generative distribution. This pre-trained model is then paired with the NMF-based noise model to estimate clean speech iteratively. Specifically, a diffusion-based posterior sampling approach is implemented within the reverse diffusion process, where after each iteration, a speech estimate is obtained and used to update the noise parameters. Experimental results confirm that the proposed AVSE approach not only outperforms its audio-only counterpart but also generalizes better than a recent supervised-generative AVSE method. Additionally, the new inference algorithm offers a better balance between inference speed and performance compared to the previous di
Authors
(none)
Tags
Stats
Related papers
- Av2wav: Diffusion-based Re-synthesis From Continuous Self-supervised Features For Audio-visual Speech Enhancement (2023)0.00
- Robust Unsupervised Audio-visual Speech Enhancement Using A Mixture Of Variational Autoencoders (2019)9.23
- Audio-visual Speech Enhancement With A Deep Kalman Filter Generative Model (2022)6.34
- Noise-aware Speech Enhancement Using Diffusion Probabilistic Model (2023)8.82
- Diffusion-based Speech Enhancement With A Weighted Generative-supervised Learning Loss (2023)0.00
- Seeing Through The Conversation: Audio-visual Speech Separation Based On Diffusion Model (2023)7.50
- Unsupervised Speech Enhancement With Deep Dynamical Generative Speech And Noise Models (2023)0.00
- Deep Variational Generative Models For Audio-visual Speech Separation (2020)0.00