Zero-shot Personalized Lip-to-speech Synthesis With Face Image Based Voice Control
2023 Β· Zheng-Yan Sheng, Yang Ai, Zhen-Hua Ling
Abstract
Lip-to-Speech (Lip2Speech) synthesis, which predicts corresponding speech from talking face images, has witnessed significant progress with various models and training strategies in a series of independent studies. However, existing studies can not achieve voice control under zero-shot condition, because extra speaker embeddings need to be extracted from natural reference speech and are unavailable when only the silent video of an unseen speaker is given. In this paper, we propose a zero-shot personalized Lip2Speech synthesis method, in which face images control speaker identities. A variational autoencoder is adopted to disentangle the speaker identity and linguistic content representations, which enables speaker embeddings to control the voice characteristics of synthetic speech for unseen speakers. Furthermore, we propose associated cross-modal representation learning to promote the ability of face-based speaker embeddings (FSE) on voice control. Extensive experiments verify the eff
Authors
(none)
Tags
Stats
Related papers
- Face-stylespeech: Enhancing Zero-shot Speech Synthesis From Face Images With Improved Face-to-speech Mapping (2023)2.26
- Seeing Your Speech Style: A Novel Zero-shot Identity-disentanglement Face-based Voice Conversion (2024)4.52
- Face-driven Zero-shot Voice Conversion With Memory-based Face-voice Alignment (2023)5.84
- SEF-VC: Speaker Embedding Free Zero-shot Voice Conversion With Cross Attention (2023)0.00
- Content-dependent Fine-grained Speaker Embedding For Zero-shot Speaker Adaptation In Text-to-speech Synthesis (2022)10.07
- ACE-VC: Adaptive And Controllable Voice Conversion Using Explicitly Disentangled Self-supervised Speech Representations (2023)0.00
- Visagesyntalk: Unseen Speaker Video-to-speech Synthesis Via Speech-visage Feature Selection (2022)5.24
- Robust Disentangled Variational Speech Representation Learning For Zero-shot Voice Conversion (2022)10.97