Javisdit: Joint Audio-video Diffusion Transformer With Hierarchical Spatio-temporal Prior Synchronization
2025 Β· Kai Liu, Wei Li, Lai Chen, et al.
Abstract
This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG). Based on the powerful Diffusion Transformer (DiT) architecture, JavisDiT simultaneously generates high-quality audio and video content from open-ended user prompts in a unified framework. To ensure audio-video synchronization, we introduce a fine-grained spatio-temporal alignment mechanism through a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator. This module extracts both global and fine-grained spatio-temporal priors, guiding the synchronization between the visual and auditory components. Furthermore, we propose a new benchmark, JavisBench, which consists of 10,140 high-quality text-captioned sounding videos and focuses on synchronization evaluation in diverse and complex real-world scenarios. Further, we specifically devise a robust metric for measuring the synchrony between generated audio-video pairs in real-world content.
Authors
(none)
Tags
Stats
Related papers
- Javisdit++: Unified Modeling And Optimization For Joint Audio-video Generation (2026)0.00
- 3mdit: Unified Tri-modal Diffusion Transformer For Text-driven Synchronized Audio-video Generation (2025)0.00
- Syncflow: Toward Temporally Aligned Joint Audio-video Generation From Text (2024)0.00
- Diff-foley: Synchronized Video-to-audio Synthesis With Latent Diffusion Models (2023)0.00
- Aadiff: Audio-aligned Video Synthesis With Text-to-image Diffusion (2023)0.00
- Taming Text-to-sounding Video Generation Via Advanced Modality Condition And Interaction (2025)0.00
- A Simple But Strong Baseline For Sounding Video Generation: Effective Adaptation Of Audio And Video Diffusion Models For Joint Generation (2024)3.58
- Voicedit: Dual-condition Diffusion Transformer For Environment-aware Speech Synthesis (2024)5.84