Javisdit++: Unified Modeling And Optimization For Joint Audio-video Generation
2026 Β· Kai Liu, Yanhao Zheng, Kai Wang, et al.
Abstract
AIGC has rapidly expanded from text-to-image generation toward high-quality multimodal synthesis across video and audio. Within this context, joint audio-video generation (JAVG) has emerged as a fundamental task that produces synchronized and semantically aligned sound and vision from textual descriptions. However, compared with advanced commercial models such as Veo3, existing open-source methods still suffer from limitations in generation quality, temporal synchrony, and alignment with human preferences. To bridge the gap, this paper presents JavisDiT++, a concise yet powerful framework for unified modeling and optimization of JAVG. First, we introduce a modality-specific mixture-of-experts (MS-MoE) design that enables cross-modal interaction efficacy while enhancing single-modal generation quality. Then, we propose a temporal-aligned RoPE (TA-RoPE) strategy to achieve explicit, frame-level synchronization between audio and video tokens. Besides, we develop an audio-video direct pref
Authors
(none)
Tags
Stats
Related papers
- Javisdit: Joint Audio-video Diffusion Transformer With Hierarchical Spatio-temporal Prior Synchronization (2025)0.00
- Deepaudio-v1:towards Multi-modal Multi-stage End-to-end Video To Speech And Audio Generation (2025)0.00
- A Simple But Strong Baseline For Sounding Video Generation: Effective Adaptation Of Audio And Video Diffusion Models For Joint Generation (2024)3.58
- Apollo: Unified Multi-task Audio-video Joint Generation (2026)0.00
- 3mdit: Unified Tri-modal Diffusion Transformer For Text-driven Synchronized Audio-video Generation (2025)0.00
- Mmaudio: Taming Multimodal Joint Training For High-quality Video-to-audio Synthesis (2024)0.00
- Dreamfoley: Scalable Vlms For High-fidelity Video-to-audio Generation (2025)0.00
- Omnicustom: Sync Audio-video Customization Via Joint Audio-video Generation Model (2026)0.00