Let There Be Sound: Reconstructing High Quality Speech From Silent Videos
2023 Β· Ji-Hoon Kim, Jaehun Kim, Joon Son Chung
Abstract
The goal of this work is to reconstruct high quality speech from lip motions alone, a task also known as lip-to-speech. A key challenge of lip-to-speech systems is the one-to-many mapping caused by (1) the existence of homophenes and (2) multiple speech variations, resulting in a mispronounced and over-smoothed speech. In this paper, we propose a novel lip-to-speech system that significantly improves the generation quality by alleviating the one-to-many mapping problem from multiple perspectives. Specifically, we incorporate (1) self-supervised speech representations to disambiguate homophenes, and (2) acoustic variance information to model diverse speech styles. Additionally, to better solve the aforementioned problem, we employ a flow based post-net which captures and refines the details of the generated speech. We perform extensive experiments on two datasets, and demonstrate that our method achieves the generation quality close to that of real human utterance, outperforming existin
Authors
(none)
Tags
Stats
Related papers
- Lipvoicer: Generating Speech From Silent Videos Guided By Lip Reading (2023)3.89
- Fluentlip: A Phonemes-based Two-stage Approach For Audio-driven Lip Synthesis With Optical Flow Consistency (2025)0.00
- From Faces To Voices: Learning Hierarchical Representations For High-quality Video-to-speech (2025)0.00
- Improved Speech Reconstruction From Silent Video (2017)13.34
- Lipsound2: Self-supervised Pre-training For Lip-to-speech Reconstruction And Lip Reading (2021)11.39
- Speech Reconstruction From Silent Tongue And Lip Articulation By Pseudo Target Generation And Domain Adversarial Training (2023)5.84
- Lipper: Synthesizing Thy Speech Using Multi-view Lipreading (2019)10.61
- Naturall2s: End-to-end High-quality Multispeaker Lip-to-speech Synthesis With Differential Digital Signal Processing (2025)0.00