Visagesyntalk: Unseen Speaker Video-to-speech Synthesis Via Speech-visage Feature Selection
2022 Β· Joanna Hong, Minsu Kim, Yong Man Ro
Abstract
The goal of this work is to reconstruct speech from a silent talking face video. Recent studies have shown impressive performance on synthesizing speech from silent talking face videos. However, they have not explicitly considered on varying identity characteristics of different speakers, which place a challenge in the video-to-speech synthesis, and this becomes more critical in unseen-speaker settings. Our approach is to separate the speech content and the visage-style from a given silent talking face video. By guiding the model to independently focus on modeling the two representations, we can obtain the speech of high intelligibility from the model even when the input video of an unseen subject is given. To this end, we introduce speech-visage selection that separates the speech content and the speaker identity from the visual features of the input video. The disentangled representations are jointly incorporated to synthesize speech through visage-style based synthesizer which gener
Authors
(none)
Tags
Stats
Related papers
- From Faces To Voices: Learning Hierarchical Representations For High-quality Video-to-speech (2025)0.00
- Diffv2s: Diffusion-based Video-to-speech Synthesis With Vision-guided Speaker Embedding (2023)8.82
- Divise: Direct Visual-input Speech Synthesis Preserving Speaker Characteristics And Intelligibility (2025)5.58
- Zero-shot Personalized Lip-to-speech Synthesis With Face Image Based Voice Control (2023)5.84
- Facespeak: Expressive And High-quality Speech Synthesis From Human Portraits Of Different Styles (2025)0.00
- Seeing Your Speech Style: A Novel Zero-shot Identity-disentanglement Face-based Voice Conversion (2024)4.52
- VCVTS: Multi-speaker Video-to-speech Synthesis Via Cross-modal Knowledge Transfer From Voice Conversion (2022)6.77
- See The Speaker: Crafting High-resolution Talking Faces From Speech With Prior Guidance And Region Refinement (2025)0.00