Enhance Generation Quality Of Flow Matching V2A Model Via Multi-step Cot-like Guidance And Combined Preference Optimization
2025 Β· Haomin Zhang, Sizhe Shan, Haoyu Wang, et al.
Abstract
Creating high-quality sound effects from videos and text prompts requires precise alignment between visual and audio domains, both semantically and temporally, along with step-by-step guidance for professional audio generation. However, current state-of-the-art video-guided audio generation models often fall short of producing high-quality audio for both general and specialized use cases. To address this challenge, we introduce a multi-stage, multi-modal, end-to-end generative framework with Chain-of-Thought-like (CoT-like) guidance learning, termed Chain-of-Perform (CoP). First, we employ a transformer-based network architecture designed to achieve CoP guidance, enabling the generation of both general and professional audio. Second, we implement a multi-stage training framework that follows step-by-step guidance to ensure the generation of high-quality sound effects. Third, we develop a CoP multi-modal dataset, guided by video, to support step-by-step sound effects generation. Evaluat
Authors
(none)
Tags
Stats
Related papers
- Deepsound-v1: Start To Think Step-by-step In The Audio Generation From Videos (2025)0.00
- Sound-vecaps: Improving Audio Generation With Visual Enhanced Captions (2024)7.16
- Audiomog: Guiding Audio Generation With Mixture-of-guidance (2025)0.00
- Deepaudio-v1:towards Multi-modal Multi-stage End-to-end Video To Speech And Audio Generation (2025)0.00
- Javisdit++: Unified Modeling And Optimization For Joint Audio-video Generation (2026)0.00
- Hunyuanvideo-foley: Multimodal Diffusion With Representation Alignment For High-fidelity Foley Audio Generation (2025)0.00
- A Simple But Strong Baseline For Sounding Video Generation: Effective Adaptation Of Audio And Video Diffusion Models For Joint Generation (2024)3.58
- Semantically Consistent Video-to-audio Generation Using Multimodal Language Large Model (2024)0.00