Robustl2s: Speaker-specific Lip-to-speech Synthesis Exploiting Self-supervised Representations
2023 · Neha Sahipjohn, Neil Shah, Vishal Tambrahalli, et al.
Abstract
Significant progress has been made in speaker dependent Lip-to-Speech synthesis, which aims to generate speech from silent videos of talking faces. Current state-of-the-art approaches primarily employ non-autoregressive sequence-to-sequence architectures to directly predict mel-spectrograms or audio waveforms from lip representations. We hypothesize that the direct mel-prediction hampers training/model efficiency due to the entanglement of speech content with ambient information and speaker characteristics. To this end, we propose RobustL2S, a modularized framework for Lip-to-Speech synthesis. First, a non-autoregressive sequence-to-sequence model maps self-supervised visual features to a representation of disentangled speech content. A vocoder then converts the speech features into raw waveforms. Extensive evaluations confirm the effectiveness of our setup, achieving state-of-the-art performance on the unconstrained Lip2Wav dataset and the constrained GRID and TCD-TIMIT datasets. Spee
Authors
(none)
Tags
Stats
Related papers
- Lipsound2: Self-supervised Pre-training For Lip-to-speech Reconstruction And Lip Reading (2021)11.39
- Naturall2s: End-to-end High-quality Multispeaker Lip-to-speech Synthesis With Differential Digital Signal Processing (2025)0.00
- Lipvoicer: Generating Speech From Silent Videos Guided By Lip Reading (2023)3.89
- Lip2vec: Efficient And Robust Visual Speech Recognition Via Latent-to-latent Visual To Audio Representation Mapping (2023)6.77
- Rep2wav: Noise Robust Text-to-speech Using Self-supervised Representations (2023)0.00
- Learning Speech Representations From Raw Audio By Joint Audiovisual Self-supervision (2020)0.00
- Let There Be Sound: Reconstructing High Quality Speech From Silent Videos (2023)6.34
- Enhancing The Stability Of Llm-based Speech Generation Systems Through Self-supervised Representations (2024)0.00