Funcineforge: A Unified Dataset Toolkit And Model For Zero-shot Movie Dubbing In Diverse Cinematic Scenes
2026 Β· Jiaxuan Liu, Yang Xiang, Han Zhao, et al.
Abstract
Movie dubbing is the task of synthesizing speech from scripts conditioned on video scenes, requiring accurate lip sync, faithful timbre transfer, and proper modeling of character identity and emotion. However, existing methods face two major limitations: (1) high-quality multimodal dubbing datasets are limited in scale, suffer from high word error rates, contain sparse annotations, rely on costly manual labeling, and are restricted to monologue scenes, all of which hinder effective model training; (2) existing dubbing models rely solely on the lip region to learn audio-visual alignment, which limits their applicability to complex live-action cinematic scenes, and exhibit suboptimal performance in lip sync, speech quality, and emotional expressiveness. To address these issues, we propose FunCineForge, which comprises an end-to-end production pipeline for large-scale dubbing datasets and an MLLM-based dubbing model designed for diverse cinematic scenes. Using the pipeline, we construct t
Authors
(none)
Tags
Stats
Related papers
- Voicecraft-dub: Automated Video Dubbing With Neural Codec Language Models (2025)0.00
- Prosody-enhanced Acoustic Pre-training And Acoustic-disentangled Prosody Adapting For Movie Dubbing (2025)3.58
- Learning To Dub Movies Via Hierarchical Prosody Models (2022)10.97
- Dubbing In Practice: A Large Scale Study Of Human Localization With Insights For Automatic Dubbing (2022)8.82
- ANIM-400K: A Large-scale Dataset For Automated End-to-end Dubbing Of Video (2024)8.65
- Large-scale Multilingual Audio Visual Dubbing (2020)0.00
- Emodubber: Towards High Quality And Emotion Controllable Movie Dubbing (2024)4.52
- Mcdubber: Multimodal Context-aware Expressive Video Dubbing (2024)5.91