Dreamfoley: Scalable Vlms For High-fidelity Video-to-audio Generation
2025 Β· Fu Li, Weichao Zhao, You Li, et al.
Abstract
Recent advances in video generation have achieved remarkable improvements in visual content fidelity. However, the absence of synchronized audio severely undermines immersive experience and restricts practical applications of these technologies. To address this challenge, several pioneering works have explored diffusion transformer architectures for generating plausible video-synchronized audio, including Kling-foley, HunyuanVideo-foley and Thinksound. Distinct from existing works, we introduce an autoregressive audio generation architecture (DreamFoley) that harnesses the capabilities of large vision-language models (VLMs) to jointly model sequential interactions among video, audio, and text modalities. Our approach features a dual-visual encoder module that effectively captures both audio-aligned and text-aligned visual features. Additionally, we employ a Residual Vector Quantization audio tokenizer with a delay-pattern generation scheme to balance the trade-off between training effi
Authors
(none)
Tags
Stats
Related papers
- Hunyuanvideo-foley: Multimodal Diffusion With Representation Alignment For High-fidelity Foley Audio Generation (2025)0.00
- Diff-foley: Synchronized Video-to-audio Synthesis With Latent Diffusion Models (2023)0.00
- Semantically Consistent Video-to-audio Generation Using Multimodal Language Large Model (2024)0.00
- Syncflow: Toward Temporally Aligned Joint Audio-video Generation From Text (2024)0.00
- 3mdit: Unified Tri-modal Diffusion Transformer For Text-driven Synchronized Audio-video Generation (2025)0.00
- Frieren: Efficient Video-to-audio Generation Network With Rectified Flow Matching (2024)0.00
- Video-to-audio Generation With Hidden Alignment (2024)0.00
- Deepaudio-v1:towards Multi-modal Multi-stage End-to-end Video To Speech And Audio Generation (2025)0.00