Deepaudio-v1:towards Multi-modal Multi-stage End-to-end Video To Speech And Audio Generation
2025 Β· Haomin Zhang, Chang Liu, Junjie Zheng, et al.
Abstract
Currently, high-quality, synchronized audio is synthesized using various multi-modal joint learning frameworks, leveraging video and optional text inputs. In the video-to-audio benchmarks, video-to-audio quality, semantic alignment, and audio-visual synchronization are effectively achieved. However, in real-world scenarios, speech and audio often coexist in videos simultaneously, and the end-to-end generation of synchronous speech and audio given video and text conditions are not well studied. Therefore, we propose an end-to-end multi-modal generation framework that simultaneously produces speech and audio based on video and text conditions. Furthermore, the advantages of video-to-audio (V2A) models for generating speech from videos remain unclear. The proposed framework, DeepAudio, consists of a video-to-audio (V2A) module, a text-to-speech (TTS) module, and a dynamic mixture of modality fusion (MoF) module. In the evaluation, the proposed end-to-end framework achieves state-of-the-ar
Authors
(none)
Tags
Stats
Related papers
- Deepsound-v1: Start To Think Step-by-step In The Audio Generation From Videos (2025)0.00
- Mmaudio: Taming Multimodal Joint Training For High-quality Video-to-audio Synthesis (2024)0.00
- Hunyuanvideo-foley: Multimodal Diffusion With Representation Alignment For High-fidelity Foley Audio Generation (2025)0.00
- Semantically Consistent Video-to-audio Generation Using Multimodal Language Large Model (2024)0.00
- 3mdit: Unified Tri-modal Diffusion Transformer For Text-driven Synchronized Audio-video Generation (2025)0.00
- Javisdit++: Unified Modeling And Optimization For Joint Audio-video Generation (2026)0.00
- Diverse And Aligned Audio-to-video Generation Via Text-to-video Model Adaptation (2023)11.19
- Taming Text-to-sounding Video Generation Via Advanced Modality Condition And Interaction (2025)0.00