Towards Expressive Video Dubbing With Multiscale Multimodal Context Interaction
2024 Β· Yuan Zhao, Rui Liu, Gaoxiang Cong
Abstract
Automatic Video Dubbing (AVD) generates speech aligned with lip motion and facial emotion from scripts. Recent research focuses on modeling multimodal context to enhance prosody expressiveness but overlooks two key issues: 1) Multiscale prosody expression attributes in the context influence the current sentence's prosody. 2) Prosody cues in context interact with the current sentence, impacting the final prosody expressiveness. To tackle these challenges, we propose M2CI-Dubber, a Multiscale Multimodal Context Interaction scheme for AVD. This scheme includes two shared M2CI encoders to model the multiscale multimodal context and facilitate its deep interaction with the current sentence. By extracting global and local features for each modality in the context, utilizing attention-based mechanisms for aggregation and interaction, and employing an interaction-based graph attention network for fusion, the proposed approach enhances the prosody expressiveness of synthesized speech for the cu
Authors
(none)
Tags
Stats
Related papers
- Mcdubber: Multimodal Context-aware Expressive Video Dubbing (2024)5.91
- Learning To Dub Movies Via Hierarchical Prosody Models (2022)10.97
- Voicecraft-dub: Automated Video Dubbing With Neural Codec Language Models (2025)0.00
- Prosody-enhanced Acoustic Pre-training And Acoustic-disentangled Prosody Adapting For Movie Dubbing (2025)3.58
- Joint Multi-scale Cross-lingual Speaking Style Transfer With Bidirectional Attention Mechanism For Automatic Dubbing (2023)5.24
- Taming Text-to-sounding Video Generation Via Advanced Modality Condition And Interaction (2025)0.00
- Neural Dubber: Dubbing For Videos According To Scripts (2021)0.00
- Dubwise: Video-guided Speech Duration Control In Multimodal Llm-based Text-to-speech For Dubbing (2024)3.58