Diff-foley: Synchronized Video-to-audio Synthesis With Latent Diffusion Models
2023 Β· Simian Luo, Chuanhao Yan, Chenxu Hu, et al.
Abstract
The Video-to-Audio (V2A) model has recently gained attention for its practical application in generating audio directly from silent videos, particularly in video/film production. However, previous methods in V2A have limited generation quality in terms of temporal synchronization and audio-visual relevance. We present Diff-Foley, a synchronized Video-to-Audio synthesis method with a latent diffusion model (LDM) that generates high-quality audio with improved synchronization and audio-visual relevance. We adopt contrastive audio-visual pretraining (CAVP) to learn more temporally and semantically aligned features, then train an LDM with CAVP-aligned visual features on spectrogram latent space. The CAVP-aligned features enable LDM to capture the subtler audio-visual correlation via a cross-attention module. We further significantly improve sample quality with `double guidance'. Diff-Foley achieves state-of-the-art V2A performance on current large scale V2A dataset. Furthermore, we demonst
Authors
(none)
Tags
Stats
Related papers
- Dreamfoley: Scalable Vlms For High-fidelity Video-to-audio Generation (2025)0.00
- Hunyuanvideo-foley: Multimodal Diffusion With Representation Alignment For High-fidelity Foley Audio Generation (2025)0.00
- Aadiff: Audio-aligned Video Synthesis With Text-to-image Diffusion (2023)0.00
- Av-link: Temporally-aligned Diffusion Features For Cross-modal Audio-video Generation (2024)0.00
- Diffusion Models As Masked Audio-video Learners (2023)0.00
- Seeing Through The Conversation: Audio-visual Speech Separation Based On Diffusion Model (2023)7.50
- 3mdit: Unified Tri-modal Diffusion Transformer For Text-driven Synchronized Audio-video Generation (2025)0.00
- A Simple But Strong Baseline For Sounding Video Generation: Effective Adaptation Of Audio And Video Diffusion Models For Joint Generation (2024)3.58