Lipvoicer: Generating Speech From Silent Videos Guided By Lip Reading
2023 Β· Yochai Yemini, Aviv Shamsian, Lior Bracha, et al.
Abstract
Lip-to-speech involves generating a natural-sounding speech synchronized with a soundless video of a person talking. Despite recent advances, current methods still cannot produce high-quality speech with high levels of intelligibility for challenging and realistic datasets such as LRS3. In this work, we present LipVoicer, a novel method that generates high-quality speech, even for in-the-wild and rich datasets, by incorporating the text modality. Given a silent video, we first predict the spoken text using a pre-trained lip-reading network. We then condition a diffusion model on the video and use the extracted text through a classifier-guidance mechanism where a pre-trained ASR serves as the classifier. LipVoicer outperforms multiple lip-to-speech baselines on LRS2 and LRS3, which are in-the-wild datasets with hundreds of unique speakers in their test set and an unrestricted vocabulary. Moreover, our experiments show that the inclusion of the text modality plays a major role in the int
Authors
(none)
Tags
Stats
Related papers
- Lipper: Synthesizing Thy Speech Using Multi-view Lipreading (2019)10.61
- Let There Be Sound: Reconstructing High Quality Speech From Silent Videos (2023)6.34
- Lip2vec: Efficient And Robust Visual Speech Recognition Via Latent-to-latent Visual To Audio Representation Mapping (2023)6.77
- Lipsound2: Self-supervised Pre-training For Lip-to-speech Reconstruction And Lip Reading (2021)11.39
- Visualtts: TTS With Accurate Lip-speech Synchronization For Automatic Voice Over (2021)9.41
- Lipger: Visually-conditioned Generative Error Correction For Robust Automatic Speech Recognition (2024)2.26
- Robustl2s: Speaker-specific Lip-to-speech Synthesis Exploiting Self-supervised Representations (2023)4.52
- Naturall2s: End-to-end High-quality Multispeaker Lip-to-speech Synthesis With Differential Digital Signal Processing (2025)0.00