Visualtts: TTS With Accurate Lip-speech Synchronization For Automatic Voice Over
2021 Β· Junchen Lu, Berrak Sisman, Rui Liu, et al.
Abstract
In this paper, we formulate a novel task to synthesize speech in sync with a silent pre-recorded video, denoted as automatic voice over (AVO). Unlike traditional speech synthesis, AVO seeks to generate not only human-sounding speech, but also perfect lip-speech synchronization. A natural solution to AVO is to condition the speech rendering on the temporal progression of lip sequence in the video. We propose a novel text-to-speech model that is conditioned on visual input, named VisualTTS, for accurate lip-speech synchronization. The proposed VisualTTS adopts two novel mechanisms that are 1) textual-visual attention, and 2) visual fusion strategy during acoustic decoding, which both contribute to forming accurate alignment between the input text content and lip motion in input lip sequence. Experimental results show that VisualTTS achieves accurate lip-speech synchronization and outperforms all baseline systems.
Authors
(none)
Tags
Stats
Related papers
- High-quality Automatic Voice Over With Accurate Alignment: Supervision Through Self-supervised Discrete Speech Units (2023)6.34
- Improving Lip-synchrony In Direct Audio-visual Speech-to-speech Translation (2024)0.00
- More Than Words: In-the-wild Visually-driven Prosody For Text-to-speech (2021)9.03
- Text-to-audio Generation Synchronized With Videos (2024)0.00
- AV2AV: Direct Audio-visual Speech To Audio-visual Speech Translation With Unified Audio-visual Speech Representation (2023)6.77
- Lipvoicer: Generating Speech From Silent Videos Guided By Lip Reading (2023)3.89
- Attention-based Audio-visual Fusion For Robust Automatic Speech Recognition (2018)16.67
- Audiovisual Speech Synthesis Using Tacotron2 (2020)8.09