Deep Variational Generative Models For Audio-visual Speech Separation
2020 Β· Viet-Nhat Nguyen, Mostafa Sadeghi, Elisa Ricci, et al.
Abstract
In this paper, we are interested in audio-visual speech separation given a single-channel audio recording as well as visual information (lips movements) associated with each speaker. We propose an unsupervised technique based on audio-visual generative modeling of clean speech. More specifically, during training, a latent variable generative model is learned from clean speech spectrograms using a variational auto-encoder (VAE). To better utilize the visual information, the posteriors of the latent variables are inferred from mixed speech (instead of clean speech) as well as the visual data. The visual modality also serves as a prior for latent variables, through a visual network. At test time, the learned generative model (both for speaker-independent and speaker-dependent scenarios) is combined with an unsupervised non-negative matrix factorization (NMF) variance model for background noise. All the latent variables and noise parameters are then estimated by a Monte Carlo expectation-m
Authors
(none)
Tags
Stats
Related papers
- Robust Unsupervised Audio-visual Speech Enhancement Using A Mixture Of Variational Autoencoders (2019)9.23
- Audio-visual Speech Enhancement Using Conditional Variational Auto-encoders (2019)13.65
- Mixture Of Inference Networks For Vae-based Audio-visual Speech Enhancement (2019)10.35
- Audio-visual Speech Enhancement With A Deep Kalman Filter Generative Model (2022)6.34
- A Multimodal Dynamical Variational Autoencoder For Audiovisual Speech Representation Learning (2023)2.26
- Switching Variational Auto-encoders For Noise-agnostic Audio-visual Speech Enhancement (2021)7.16
- Generalized Multichannel Variational Autoencoder For Underdetermined Source Separation (2018)7.81
- Time Domain Audio Visual Speech Separation (2019)14.62