Audiovisual Speech Synthesis Using Tacotron2
2020 Β· Ahmed Hussen Abdelaziz, Anushree Prasanna Kumar, Chloe Seivwright, et al.
Abstract
Audiovisual speech synthesis is the problem of synthesizing a talking face while maximizing the coherency of the acoustic and visual speech. In this paper, we propose and compare two audiovisual speech synthesis systems for 3D face models. The first system is the AVTacotron2, which is an end-to-end text-to-audiovisual speech synthesizer based on the Tacotron2 architecture. AVTacotron2 converts a sequence of phonemes representing the sentence to synthesize into a sequence of acoustic features and the corresponding controllers of a face model. The output acoustic features are used to condition a WaveRNN to reconstruct the speech waveform, and the output facial controllers are used to generate the corresponding video of the talking face. The second audiovisual speech synthesis system is modular, where acoustic speech is synthesized from text using the traditional Tacotron2. The reconstructed acoustic speech signal is then used to drive the facial controls of the face model using an indepe
Authors
(none)
Tags
Stats
Related papers
- Tacotron: Towards End-to-end Speech Synthesis (2017)0.00
- Visualtts: TTS With Accurate Lip-speech Synchronization For Automatic Voice Over (2021)9.41
- Audio-visual Speech Codecs: Rethinking Audio-visual Speech Enhancement By Re-synthesis (2022)15.58
- AV2AV: Direct Audio-visual Speech To Audio-visual Speech Translation With Unified Audio-visual Speech Representation (2023)6.77
- Parallel Tacotron: Non-autoregressive And Controllable TTS (2020)12.54
- Wave-tacotron: Spectrogram-free End-to-end Text-to-speech Synthesis (2020)12.81
- Taco-vc: A Single Speaker Tacotron Based Voice Conversion With Limited Data (2019)5.24
- Text-driven Talking Face Synthesis By Reprogramming Audio-driven Models (2023)2.26