End-to-end Video-to-speech Synthesis Using Generative Adversarial Networks
2021 Β· Rodrigo Mira, Konstantinos Vougioukas, Pingchuan Ma, et al.
Abstract
Video-to-speech is the process of reconstructing the audio speech from a video of a spoken utterance. Previous approaches to this task have relied on a two-step process where an intermediate representation is inferred from the video, and is then decoded into waveform audio using a vocoder or a waveform reconstruction algorithm. In this work, we propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs) which translates spoken video to waveform end-to-end without using any intermediate representation or separate waveform synthesis algorithm. Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech, which is then fed to a waveform critic and a power critic. The use of an adversarial loss based on these two critics enables the direct synthesis of raw audio waveform and ensures its realism. In addition, the use of our three comparative losses helps establish direct correspondence between the generated au
Authors
(none)
Tags
Stats
Related papers
- Video-driven Speech Reconstruction Using Generative Adversarial Networks (2019)11.39
- High Fidelity Speech Synthesis With Adversarial Networks (2019)0.00
- Waveform Generation For Text-to-speech Synthesis Using Pitch-synchronous Multi-scale Generative Adversarial Networks (2018)8.35
- Analysis By Adversarial Synthesis -- A Novel Approach For Speech Vocoding (2019)3.58
- Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks (2017)16.21
- End-to-end Adversarial Text-to-speech (2020)0.00
- Expediting TTS Synthesis With Adversarial Vocoding (2019)6.77
- Generative Adversarial Network-based Glottal Waveform Model For Statistical Parametric Speech Synthesis (2019)10.35