Av2wav: Diffusion-based Re-synthesis From Continuous Self-supervised Features For Audio-visual Speech Enhancement
2023 Β· Ju-Chieh Chou, Chung-Ming Chien, Karen Livescu
Abstract
Speech enhancement systems are typically trained using pairs of clean and noisy speech. In audio-visual speech enhancement (AVSE), there is not as much ground-truth clean data available; most audio-visual datasets are collected in real-world environments with background noise and reverberation, hampering the development of AVSE. In this work, we introduce AV2Wav, a resynthesis-based audio-visual speech enhancement approach that can generate clean speech despite the challenges of real-world training data. We obtain a subset of nearly clean speech from an audio-visual corpus using a neural quality estimator, and then train a diffusion model on this subset to generate waveforms conditioned on continuous speech representations from AV-HuBERT with noise-robust training. We use continuous rather than discrete representations to retain prosody and speaker information. With this vocoding task alone, the model can perform speech enhancement better than a masking-based baseline. We further fine-
Authors
(none)
Tags
Stats
Related papers
- Diffusion-based Unsupervised Audio-visual Speech Enhancement (2024)4.52
- Multichannel Av-wav2vec2: A Framework For Learning Multichannel Multi-modal Speech Representation (2024)7.16
- Audio-visual Speech Codecs: Rethinking Audio-visual Speech Enhancement By Re-synthesis (2022)15.58
- Audio-visual Speech Enhancement And Separation By Utilizing Multi-modal Self-supervised Embeddings (2022)8.60
- Flowavse: Efficient Audio-visual Speech Enhancement With Conditional Flow Matching (2024)0.00
- A Noise-robust Self-supervised Pre-training Model Based Speech Representation Learning For Automatic Speech Recognition (2022)11.19
- Audio-visual Speech Enhancement Using Conditional Variational Auto-encoders (2019)13.65
- Revise: Self-supervised Speech Resynthesis With Visual Input For Universal And Generalized Speech Enhancement (2022)0.00