Dubwise: Video-guided Speech Duration Control In Multimodal Llm-based Text-to-speech For Dubbing
2024 Β· Neha Sahipjohn, Ashishkumar Gudmalwar, Nirmesh Shah, et al.
Abstract
Audio-visual alignment after dubbing is a challenging research problem. To this end, we propose a novel method, DubWise Multi-modal Large Language Model (LLM)-based Text-to-Speech (TTS), which can control the speech duration of synthesized speech in such a way that it aligns well with the speakers lip movements given in the reference video even when the spoken text is different or in a different language. To accomplish this, we propose to utilize cross-modal attention techniques in a pre-trained GPT-based TTS. We combine linguistic tokens from text, speaker identity tokens via a voice cloning network, and video tokens via a proposed duration controller network. We demonstrate the effectiveness of our system on the Lip2Wav-Chemistry and LRS2 datasets. Also, the proposed method achieves improved lip sync and naturalness compared to the SOTAs for the same language but different text (i.e., non-parallel) and the different language, different text (i.e., cross-lingual) scenarios.
Authors
(none)
Tags
Stats
Related papers
- Large-scale Multilingual Audio Visual Dubbing (2020)0.00
- Videodubber: Machine Translation With Speech-aware Length Control For Video Dubbing (2022)8.82
- Voicecraft-dub: Automated Video Dubbing With Neural Codec Language Models (2025)0.00
- Mcdubber: Multimodal Context-aware Expressive Video Dubbing (2024)5.91
- Learning To Dub Movies Via Hierarchical Prosody Models (2022)10.97
- Neural Dubber: Dubbing For Videos According To Scripts (2021)0.00
- Towards Expressive Video Dubbing With Multiscale Multimodal Context Interaction (2024)4.52
- Joint Multi-scale Cross-lingual Speaking Style Transfer With Bidirectional Attention Mechanism For Automatic Dubbing (2023)5.24