Face-stylespeech: Enhancing Zero-shot Speech Synthesis From Face Images With Improved Face-to-speech Mapping
2023 Β· Minki Kang, Wooseok Han, Eunho Yang
Abstract
Generating speech from a face image is crucial for developing virtual humans capable of interacting using their unique voices, without relying on pre-recorded human speech. In this paper, we propose Face-StyleSpeech, a zero-shot Text-To-Speech (TTS) synthesis model that generates natural speech conditioned on a face image rather than reference speech. We hypothesize that learning entire prosodic features from a face image poses a significant challenge. To address this, our TTS model incorporates both face and prosody encoders. The prosody encoder is specifically designed to model speech style characteristics that are not fully captured by the face image, allowing the face encoder to focus on extracting speaker-specific features such as timbre. Experimental results demonstrate that Face-StyleSpeech effectively generates more natural speech from a face image than baselines, even for unseen faces. Samples are available on our demo page.
Authors
(none)
Tags
Stats
Related papers
- Facespeak: Expressive And High-quality Speech Synthesis From Human Portraits Of Different Styles (2025)0.00
- Zero-shot Personalized Lip-to-speech Synthesis With Face Image Based Voice Control (2023)5.84
- Seeing Your Speech Style: A Novel Zero-shot Identity-disentanglement Face-based Voice Conversion (2024)4.52
- Stylefusion TTS: Multimodal Style-control And Enhanced Feature Fusion For Zero-shot Text-to-speech Synthesis (2024)6.34
- Face-driven Zero-shot Voice Conversion With Memory-based Face-voice Alignment (2023)5.84
- Styletts: A Style-based Generative Model For Natural And Diverse Text-to-speech Synthesis (2022)10.97
- Controlspeech: Towards Simultaneous And Independent Zero-shot Speaker Cloning And Zero-shot Language Style Control (2024)9.40
- From Faces To Voices: Learning Hierarchical Representations For High-quality Video-to-speech (2025)0.00