Mcdubber: Multimodal Context-aware Expressive Video Dubbing
2024 Β· Yuan Zhao, Zhenqi Jia, Rui Liu, et al.
Abstract
Automatic Video Dubbing (AVD) aims to take the given script and generate speech that aligns with lip motion and prosody expressiveness. Current AVD models mainly utilize visual information of the current sentence to enhance the prosody of synthesized speech. However, it is crucial to consider whether the prosody of the generated dubbing aligns with the multimodal context, as the dubbing will be combined with the original context in the final video. This aspect has been overlooked in previous studies. To address this issue, we propose a Multimodal Context-aware video Dubbing model, termed \textbf\{MCDubber\}, to convert the modeling object from a single sentence to a longer sequence with context information to ensure the consistency of the global context prosody. MCDubber comprises three main components: (1) A context duration aligner aims to learn the context-aware alignment between the text and lip frames; (2) A context prosody predictor seeks to read the global context visual sequenc
Authors
(none)
Tags
Stats
Related papers
- Towards Expressive Video Dubbing With Multiscale Multimodal Context Interaction (2024)4.52
- Neural Dubber: Dubbing For Videos According To Scripts (2021)0.00
- Learning To Dub Movies Via Hierarchical Prosody Models (2022)10.97
- Prosody-enhanced Acoustic Pre-training And Acoustic-disentangled Prosody Adapting For Movie Dubbing (2025)3.58
- Voicecraft-dub: Automated Video Dubbing With Neural Codec Language Models (2025)0.00
- Emodubber: Towards High Quality And Emotion Controllable Movie Dubbing (2024)4.52
- Dubwise: Video-guided Speech Duration Control In Multimodal Llm-based Text-to-speech For Dubbing (2024)3.58
- Large-scale Multilingual Audio Visual Dubbing (2020)0.00