Taming Text-to-sounding Video Generation Via Advanced Modality Condition And Interaction
2025 Β· Kaisi Guan, Xihua Wang, Zhengfeng Lai, et al.
Abstract
This study focuses on a challenging yet promising task, Text-to-Sounding-Video (T2SV) generation, which aims to generate a video with synchronized audio from text conditions, meanwhile ensuring both modalities are aligned with text. Despite progress in joint audio-video training, two critical challenges still remain unaddressed: (1) a single, shared text caption where the text for video is equal to the text for audio often creates modal interference, confusing the pretrained backbones, and (2) the optimal mechanism for cross-modal feature interaction remains unclear. To address these challenges, we first propose the Hierarchical Visual-Grounded Captioning (HVGC) framework that generates pairs of disentangled captions, a video caption, and an audio caption, eliminating interference at the conditioning stage. Based on HVGC, we further introduce BridgeDiT, a novel dual-tower diffusion transformer, which employs a Dual CrossAttention (DCA) mechanism that acts as a robust ``bridge" to enabl
Authors
(none)
Tags
Stats
Related papers
- 3mdit: Unified Tri-modal Diffusion Transformer For Text-driven Synchronized Audio-video Generation (2025)0.00
- Vssflow: Unifying Video-conditioned Sound And Speech Generation Via Joint Learning (2025)0.00
- Text-to-audio Generation Synchronized With Videos (2024)0.00
- Mechanisms Of Multimodal Synchronization: Insights From Decoder-based Video-text-to-speech Synthesis (2024)0.00
- Deepaudio-v1:towards Multi-modal Multi-stage End-to-end Video To Speech And Audio Generation (2025)0.00
- Semantically Consistent Video-to-audio Generation Using Multimodal Language Large Model (2024)0.00
- Watch, Listen, And Describe: Globally And Locally Aligned Cross-modal Attentions For Video Captioning (2018)12.87
- A Better Use Of Audio-visual Cues: Dense Video Captioning With Bi-modal Transformer (2020)10.61