Mmaudio: Taming Multimodal Joint Training For High-quality Video-to-audio Synthesis
2024 Β· Ho Kei Cheng, Masato Ishii, Akio Hayakawa, et al.
Abstract
We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Additionally, we improve audio-visual synchrony with a conditional synchronization module that aligns video conditions with audio latents at the frame level. Trained with a flow matching objective, MMAudio achieves new video-to-audio state-of-the-art among public models in terms of audio quality, semantic alignment, and audio-visual synchronization, while having a low inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio also achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance. C
Authors
(none)
Tags
Stats
Related papers
- Deepaudio-v1:towards Multi-modal Multi-stage End-to-end Video To Speech And Audio Generation (2025)0.00
- Syncflow: Toward Temporally Aligned Joint Audio-video Generation From Text (2024)0.00
- Hunyuanvideo-foley: Multimodal Diffusion With Representation Alignment For High-fidelity Foley Audio Generation (2025)0.00
- Apollo: Unified Multi-task Audio-video Joint Generation (2026)0.00
- Audio-sync Video Generation With Multi-stream Temporal Control (2025)0.00
- Javisdit++: Unified Modeling And Optimization For Joint Audio-video Generation (2026)0.00
- Fake It To Make It: Using Synthetic Data To Remedy The Data Shortage In Joint Multimodal Speech-and-gesture Synthesis (2024)6.34
- Semantically Consistent Video-to-audio Generation Using Multimodal Language Large Model (2024)0.00