High-quality Automatic Voice Over With Accurate Alignment: Supervision Through Self-supervised Discrete Speech Units
2023 Β· Junchen Lu, Berrak Sisman, Mingyang Zhang, et al.
Abstract
The goal of Automatic Voice Over (AVO) is to generate speech in sync with a silent video given its text script. Recent AVO frameworks built upon text-to-speech synthesis (TTS) have shown impressive results. However, the current AVO learning objective of acoustic feature reconstruction brings in indirect supervision for inter-modal alignment learning, thus limiting the synchronization performance and synthetic speech quality. To this end, we propose a novel AVO method leveraging the learning objective of self-supervised discrete speech unit prediction, which not only provides more direct supervision for the alignment learning, but also alleviates the mismatch between the text-video context and acoustic features. Experimental results show that our proposed method achieves remarkable lip-speech synchronization and high speech quality by outperforming baselines in both objective and subjective evaluations. Code and speech samples are publicly available.
Authors
(none)
Tags
Stats
Related papers
- Visualtts: TTS With Accurate Lip-speech Synchronization For Automatic Voice Over (2021)9.41
- Improving Lip-synchrony In Direct Audio-visual Speech-to-speech Translation (2024)0.00
- AV2AV: Direct Audio-visual Speech To Audio-visual Speech Translation With Unified Audio-visual Speech Representation (2023)6.77
- Learning Speech Representations From Raw Audio By Joint Audiovisual Self-supervision (2020)0.00
- Let There Be Sound: Reconstructing High Quality Speech From Silent Videos (2023)6.34
- Unpaired Speech Enhancement By Acoustic And Adversarial Supervision For Speech Recognition (2018)10.21
- Deepaudio-v1:towards Multi-modal Multi-stage End-to-end Video To Speech And Audio Generation (2025)0.00
- Lipvoicer: Generating Speech From Silent Videos Guided By Lip Reading (2023)3.89