Deepsound-v1: Start To Think Step-by-step In The Audio Generation From Videos
2025 Β· Yunming Liang, Zihao Chen, Chaofan Ding, et al.
Abstract
Currently, high-quality, synchronized audio is synthesized from video and optional text inputs using various multi-modal joint learning frameworks. However, the precise alignment between the visual and generated audio domains remains far from satisfactory. One key factor is the lack of sufficient temporal and semantic alignment annotations in open-source video-audio and text-audio benchmarks. Therefore, we propose a framework for audio generation from videos, leveraging the internal chain-of-thought (CoT) of a multi-modal large language model (MLLM) to enable step-by-step reasoning without requiring additional annotations. Additionally, a corresponding multi-modal reasoning dataset is constructed to facilitate the learning of initial reasoning in audio generation. In the experiments, we demonstrate the effectiveness of the proposed framework in reducing misalignment (voice-over) in generated audio and achieving competitive performance compared to various state-of-the-art models. The ev
Authors
(none)
Tags
Stats
Related papers
- Deepaudio-v1:towards Multi-modal Multi-stage End-to-end Video To Speech And Audio Generation (2025)0.00
- Enhance Generation Quality Of Flow Matching V2A Model Via Multi-step Cot-like Guidance And Combined Preference Optimization (2025)0.00
- Semantically Consistent Video-to-audio Generation Using Multimodal Language Large Model (2024)0.00
- Video-to-audio Generation With Hidden Alignment (2024)0.00
- Dreamfoley: Scalable Vlms For High-fidelity Video-to-audio Generation (2025)0.00
- Mmaudio: Taming Multimodal Joint Training For High-quality Video-to-audio Synthesis (2024)0.00
- Hunyuanvideo-foley: Multimodal Diffusion With Representation Alignment For High-fidelity Foley Audio Generation (2025)0.00
- Audio-agent: Leveraging Llms For Audio Generation, Editing And Composition (2024)0.00