From Inference To Generation: End-to-end Fully Self-supervised Generation Of Human Face From Speech
2020 Β· Hyeong-Seok Choi, Changdae Park, Kyogu Lee
Abstract
This work seeks the possibility of generating the human face from voice solely based on the audio-visual data without any human-labeled annotations. To this end, we propose a multi-modal learning framework that links the inference stage and generation stage. First, the inference networks are trained to match the speaker identity between the two different modalities. Then the trained inference networks cooperate with the generation network by giving conditional information about the voice. The proposed method exploits the recent development of GANs techniques and generates the human face directly from the speech waveform making our system fully end-to-end. We analyze the extent to which the network can naturally disentangle two latent factors that contribute to the generation of a face image - one that comes directly from a speech signal and the other that is not related to it - and explore whether the network can learn to generate natural human face image distribution by modeling these
Authors
(none)
Tags
Stats
Related papers
- Reconstructing Faces From Voices (2019)0.00
- Facetron: A Multi-speaker Face-to-speech Model Based On Cross-modal Latent Representations (2021)0.00
- Audio Input Generates Continuous Frames To Synthesize Facial Video Using Generative Adiversarial Networks (2022)0.00
- Audio2face: Generating Speech/face Animation From Single Audio With Attention-based Bidirectional LSTM Networks (2019)12.10
- Video-driven Speech Reconstruction Using Generative Adversarial Networks (2019)11.39
- From Faces To Voices: Learning Hierarchical Representations For High-quality Video-to-speech (2025)0.00
- End-to-end Video-to-speech Synthesis Using Generative Adversarial Networks (2021)11.58
- A Unified Compression Framework For Efficient Speech-driven Talking-face Generation (2023)0.00