From Faces To Voices: Learning Hierarchical Representations For High-quality Video-to-speech
2025 Β· Ji-Hoon Kim, Jeongsoo Choi, Jaehun Kim, et al.
Abstract
The objective of this study is to generate high-quality speech from silent talking face videos, a task also known as video-to-speech synthesis. A significant challenge in video-to-speech synthesis lies in the substantial modality gap between silent video and multi-faceted speech. In this paper, we propose a novel video-to-speech system that effectively bridges this modality gap, significantly enhancing the quality of synthesized speech. This is achieved by learning of hierarchical representations from video to speech. Specifically, we gradually transform silent video into acoustic feature spaces through three sequential stages -- content, timbre, and prosody modeling. In each stage, we align visual factors -- lip movements, face identity, and facial expressions -- with corresponding acoustic counterparts to ensure the seamless transformation. Additionally, to generate realistic and coherent speech from the visual representations, we employ a flow matching model that estimates direct tr
Authors
(none)
Tags
Stats
Related papers
- Visagesyntalk: Unseen Speaker Video-to-speech Synthesis Via Speech-visage Feature Selection (2022)5.24
- Let There Be Sound: Reconstructing High Quality Speech From Silent Videos (2023)6.34
- See The Speaker: Crafting High-resolution Talking Faces From Speech With Prior Guidance And Region Refinement (2025)0.00
- Learning To Dub Movies Via Hierarchical Prosody Models (2022)10.97
- Transface: Unit-based Audio-visual Speech Synthesizer For Talking Head Translation (2023)7.16
- Improved Speech Reconstruction From Silent Video (2017)13.34
- Face-stylespeech: Enhancing Zero-shot Speech Synthesis From Face Images With Improved Face-to-speech Mapping (2023)2.26
- From Inference To Generation: End-to-end Fully Self-supervised Generation Of Human Face From Speech (2020)0.00