Neural Dubber: Dubbing For Videos According To Scripts
2021 Β· Chenxu Hu, Qiao Tian, Tingle Li, et al.
Abstract
Dubbing is a post-production process of re-recording actors' dialogues, which is extensively used in filmmaking and video production. It is usually performed manually by professional voice actors who read lines with proper prosody, and in synchronization with the pre-recorded videos. In this work, we propose Neural Dubber, the first neural network model to solve a novel automatic video dubbing (AVD) task: synthesizing human speech synchronized with the given video from the text. Neural Dubber is a multi-modal text-to-speech (TTS) model that utilizes the lip movement in the video to control the prosody of the generated speech. Furthermore, an image-based speaker embedding (ISE) module is developed for the multi-speaker setting, which enables Neural Dubber to generate speech with a reasonable timbre according to the speaker's face. Experiments on the chemistry lecture single-speaker dataset and LRS2 multi-speaker dataset show that Neural Dubber can generate speech audios on par with stat
Authors
(none)
Tags
Stats
Related papers
- Voicecraft-dub: Automated Video Dubbing With Neural Codec Language Models (2025)0.00
- Learning To Dub Movies Via Hierarchical Prosody Models (2022)10.97
- Mcdubber: Multimodal Context-aware Expressive Video Dubbing (2024)5.91
- Emodubber: Towards High Quality And Emotion Controllable Movie Dubbing (2024)4.52
- Prosody-enhanced Acoustic Pre-training And Acoustic-disentangled Prosody Adapting For Movie Dubbing (2025)3.58
- Large-scale Multilingual Audio Visual Dubbing (2020)0.00
- Dubbing In Practice: A Large Scale Study Of Human Localization With Insights For Automatic Dubbing (2022)8.82
- Dubwise: Video-guided Speech Duration Control In Multimodal Llm-based Text-to-speech For Dubbing (2024)3.58