Prosody-enhanced Acoustic Pre-training And Acoustic-disentangled Prosody Adapting For Movie Dubbing
2025 Β· Zhedong Zhang, Liang Li, Chenggang Yan, et al.
Abstract
Movie dubbing describes the process of transforming a script into speech that aligns temporally and emotionally with a given movie clip while exemplifying the speaker's voice demonstrated in a short reference audio clip. This task demands the model bridge character performances and complicated prosody structures to build a high-quality video-synchronized dubbing track. The limited scale of movie dubbing datasets, along with the background noise inherent in audio data, hinder the acoustic modeling performance of trained models. To address these issues, we propose an acoustic-prosody disentangled two-stage method to achieve high-quality dubbing generation with precise prosody alignment. First, we propose a prosody-enhanced acoustic pre-training to develop robust acoustic modeling capabilities. Then, we freeze the pre-trained acoustic system and design a disentangled framework to model prosodic text features and dubbing style while maintaining acoustic quality. Additionally, we incorporat
Authors
(none)
Tags
Stats
Related papers
- Learning To Dub Movies Via Hierarchical Prosody Models (2022)10.97
- Mcdubber: Multimodal Context-aware Expressive Video Dubbing (2024)5.91
- Emodubber: Towards High Quality And Emotion Controllable Movie Dubbing (2024)4.52
- Prosodic Phrase Alignment For Machine Dubbing (2019)8.60
- Dubbing In Practice: A Large Scale Study Of Human Localization With Insights For Automatic Dubbing (2022)8.82
- IQDUBBING: Prosody Modeling Based On Discrete Self-supervised Speech Representation For Expressive Voice Conversion (2022)0.00
- Neural Dubber: Dubbing For Videos According To Scripts (2021)0.00
- Towards Expressive Video Dubbing With Multiscale Multimodal Context Interaction (2024)4.52