La-voce: Low-snr Audio-visual Speech Enhancement Using Neural Vocoders
2022 Β· Rodrigo Mira, Buye Xu, Jacob Donley, et al.
Abstract
Audio-visual speech enhancement aims to extract clean speech from a noisy environment by leveraging not only the audio itself but also the target speaker's lip movements. This approach has been shown to yield improvements over audio-only speech enhancement, particularly for the removal of interfering speech. Despite recent advances in speech synthesis, most audio-visual approaches continue to use spectral mapping/masking to reproduce the clean audio, often resulting in visual backbones added to existing speech enhancement architectures. In this work, we propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture, and then converts them into waveform audio using a neural vocoder (HiFi-GAN). We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference. Our experiments show that
Authors
(none)
Tags
Stats
Related papers
- Audio-visual Speech Codecs: Rethinking Audio-visual Speech Enhancement By Re-synthesis (2022)15.58
- Vocgan: A High-fidelity Real-time Vocoder With A Hierarchically-nested Adversarial Network (2020)12.54
- VSEGAN: Visual Speech Enhancement Generative Adversarial Network (2021)8.60
- Vocos: Closing The Gap Between Time-domain And Fourier-based Neural Vocoders For High-quality Audio Synthesis (2023)6.10
- Bigvgan: A Universal Neural Vocoder With Large-scale Training (2022)6.17
- Vnet: A Gan-based Multi-tier Discriminator Network For Speech Synthesis Vocoders (2024)2.26
- Univnet: A Neural Vocoder With Multi-resolution Spectrogram Discriminators For High-fidelity Waveform Generation (2021)14.80
- Lip2vec: Efficient And Robust Visual Speech Recognition Via Latent-to-latent Visual To Audio Representation Mapping (2023)6.77