Robust Unsupervised Audio-visual Speech Enhancement Using A Mixture Of Variational Autoencoders
2019 Β· Mostafa Sadeghi, Xavier Alameda-Pineda
Abstract
Recently, an audio-visual speech generative model based on variational autoencoder (VAE) has been proposed, which is combined with a nonnegative matrix factorization (NMF) model for noise variance to perform unsupervised speech enhancement. When visual data is clean, speech enhancement with audio-visual VAE shows a better performance than with audio-only VAE, which is trained on audio-only data. However, audio-visual VAE is not robust against noisy visual data, e.g., when for some video frames, speaker face is not frontal or lips region is occluded. In this paper, we propose a robust unsupervised audio-visual speech enhancement method based on a per-frame VAE mixture model. This mixture model consists of a trained audio-only VAE and a trained audio-visual VAE. The motivation is to skip noisy visual frames by switching to the audio-only VAE model. We present a variational expectation-maximization method to estimate the parameters of the model. Experiments show the promising performance
Authors
(none)
Tags
Stats
Related papers
- Mixture Of Inference Networks For Vae-based Audio-visual Speech Enhancement (2019)10.35
- Audio-visual Speech Enhancement Using Conditional Variational Auto-encoders (2019)13.65
- Switching Variational Auto-encoders For Noise-agnostic Audio-visual Speech Enhancement (2021)7.16
- Deep Variational Generative Models For Audio-visual Speech Separation (2020)0.00
- Audio-visual Speech Enhancement With A Deep Kalman Filter Generative Model (2022)6.34
- Statistical Speech Enhancement Based On Probabilistic Integration Of Variational Autoencoder And Non-negative Matrix Factorization (2017)15.00
- Diffusion-based Unsupervised Audio-visual Speech Enhancement (2024)4.52
- A Statistically Principled And Computationally Efficient Approach To Speech Enhancement Using Variational Autoencoders (2019)9.23