Emodubber: Towards High Quality And Emotion Controllable Movie Dubbing
2024 Β· Gaoxiang Cong, Jiadong Pan, Liang Li, et al.
Abstract
Given a piece of text, a video clip, and a reference audio, the movie dubbing task aims to generate speech that aligns with the video while cloning the desired voice. The existing methods have two primary deficiencies: (1) They struggle to simultaneously hold audio-visual sync and achieve clear pronunciation; (2) They lack the capacity to express user-defined emotions. To address these problems, we propose EmoDubber, an emotion-controllable dubbing architecture that allows users to specify emotion type and emotional intensity while satisfying high-quality lip sync and pronunciation. Specifically, we first design Lip-related Prosody Aligning (LPA), which focuses on learning the inherent consistency between lip motion and prosody variation by duration level contrastive learning to incorporate reasonable alignment. Then, we design Pronunciation Enhancing (PE) strategy to fuse the video-level phoneme sequences by efficient conformer to improve speech intelligibility. Next, the speaker iden
Authors
(none)
Tags
Stats
Related papers
- Learning To Dub Movies Via Hierarchical Prosody Models (2022)10.97
- Prosody-enhanced Acoustic Pre-training And Acoustic-disentangled Prosody Adapting For Movie Dubbing (2025)3.58
- Neural Dubber: Dubbing For Videos According To Scripts (2021)0.00
- Mcdubber: Multimodal Context-aware Expressive Video Dubbing (2024)5.91
- Voicecraft-dub: Automated Video Dubbing With Neural Codec Language Models (2025)0.00
- Dubbing In Practice: A Large Scale Study Of Human Localization With Insights For Automatic Dubbing (2022)8.82
- Prosodic Phrase Alignment For Machine Dubbing (2019)8.60
- Videodubber: Machine Translation With Speech-aware Length Control For Video Dubbing (2022)8.82