Mixture Of Inference Networks For Vae-based Audio-visual Speech Enhancement
2019 Β· Mostafa Sadeghi, Xavier Alameda-Pineda
Abstract
In this paper, we are interested in unsupervised (unknown noise) audio-visual speech enhancement based on variational autoencoders (VAEs), where the probability distribution of clean speech spectra is simulated using an encoder-decoder architecture. The trained generative model (decoder) is then combined with a noise model at test time to estimate the clean speech. In the speech enhancement phase (test time), the initialization of the latent variables, which describe the generative process of clean speech via decoder, is crucial, as the overall inference problem is non-convex. This is usually done by using the output of the trained encoder where the noisy audio and clean visual data are given as input. Current audio-visual VAE models do not provide an effective initialization because the two modalities are tightly coupled (concatenated) in the associated architectures. To overcome this issue, inspired by mixture models, we introduce the mixture of inference networks variational autoenc
Authors
(none)
Tags
Stats
Related papers
- Robust Unsupervised Audio-visual Speech Enhancement Using A Mixture Of Variational Autoencoders (2019)9.23
- Audio-visual Speech Enhancement Using Conditional Variational Auto-encoders (2019)13.65
- Switching Variational Auto-encoders For Noise-agnostic Audio-visual Speech Enhancement (2021)7.16
- A Statistically Principled And Computationally Efficient Approach To Speech Enhancement Using Variational Autoencoders (2019)9.23
- Deep Variational Generative Models For Audio-visual Speech Separation (2020)0.00
- Statistical Speech Enhancement Based On Probabilistic Integration Of Variational Autoencoder And Non-negative Matrix Factorization (2017)15.00
- Audio-visual Speech Enhancement With A Deep Kalman Filter Generative Model (2022)6.34
- Investigation Of Speech And Noise Latent Representations In Single-channel Vae-based Speech Enhancement (2025)0.00