Voicecraft-dub: Automated Video Dubbing With Neural Codec Language Models
2025 Β· Kim Sung-Bin, Jeongsoo Choi, Puyuan Peng, et al.
Abstract
We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues. This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals. Building on the success of Neural Codec Language Models (NCLMs) for speech synthesis, our method extends their capabilities by incorporating video features, ensuring that synthesized speech is time-synchronized and expressively aligned with facial movements while preserving natural prosody. To inject visual cues, we design adapters to align facial features with the NCLM token space and introduce audio-visual fusion layers to merge audio-visual information within the NCLM framework. Additionally, we curate CelebV-Dub, a new dataset of expressive, real-world videos specifically designed for automated video dubbing. Extensive experiments show that our model achieves high-quality, intelligible, and natural speech synthesis with accurate lip synchr
Authors
(none)
Tags
Stats
Related papers
- Neural Dubber: Dubbing For Videos According To Scripts (2021)0.00
- Learning To Dub Movies Via Hierarchical Prosody Models (2022)10.97
- Large-scale Multilingual Audio Visual Dubbing (2020)0.00
- Mcdubber: Multimodal Context-aware Expressive Video Dubbing (2024)5.91
- La-voce: Low-snr Audio-visual Speech Enhancement Using Neural Vocoders (2022)0.00
- Funcineforge: A Unified Dataset Toolkit And Model For Zero-shot Movie Dubbing In Diverse Cinematic Scenes (2026)0.00
- Towards Expressive Video Dubbing With Multiscale Multimodal Context Interaction (2024)4.52
- Dubwise: Video-guided Speech Duration Control In Multimodal Llm-based Text-to-speech For Dubbing (2024)3.58