A Simple But Strong Baseline For Sounding Video Generation: Effective Adaptation Of Audio And Video Diffusion Models For Joint Generation
2024 Β· Masato Ishii, Akio Hayakawa, Takashi Shibuya, et al.
Abstract
In this work, we build a simple but strong baseline for sounding video generation. Given base diffusion models for audio and video, we integrate them with additional modules into a single model and train it to make the model jointly generate audio and video. To enhance alignment between audio-video pairs, we introduce two novel mechanisms in our model. The first one is timestep adjustment, which provides different timestep information to each base model. It is designed to align how samples are generated along with timesteps across modalities. The second one is a new design of the additional modules, termed Cross-Modal Conditioning as Positional Encoding (CMC-PE). In CMC-PE, cross-modal information is embedded as if it represents temporal position information, and the embeddings are fed into the model like positional encoding. Compared with the popular cross-attention mechanism, CMC-PE provides a better inductive bias for temporal alignment in the generated data. Experimental results va
Authors
(none)
Tags
Stats
Related papers
- Mmdisco: Multi-modal Discriminator-guided Cooperative Diffusion For Joint Audio And Video Generation (2024)1.91
- Hunyuanvideo-foley: Multimodal Diffusion With Representation Alignment For High-fidelity Foley Audio Generation (2025)0.00
- Javisdit++: Unified Modeling And Optimization For Joint Audio-video Generation (2026)0.00
- Av-link: Temporally-aligned Diffusion Features For Cross-modal Audio-video Generation (2024)0.00
- Diff-foley: Synchronized Video-to-audio Synthesis With Latent Diffusion Models (2023)0.00
- Diverse And Aligned Audio-to-video Generation Via Text-to-video Model Adaptation (2023)11.19
- 3mdit: Unified Tri-modal Diffusion Transformer For Text-driven Synchronized Audio-video Generation (2025)0.00
- Aadiff: Audio-aligned Video Synthesis With Text-to-image Diffusion (2023)0.00