Vssflow: Unifying Video-conditioned Sound And Speech Generation Via Joint Learning
2025 Β· Xin Cheng, Yuyue Wang, Xihua Wang, et al.
Abstract
Video-conditioned audio generation, including Video-to-Sound (V2S) and Visual Text-to-Speech (VisualTTS), has traditionally been treated as distinct tasks, leaving the potential for a unified generative framework largely underexplored. In this paper, we bridge this gap with VSSFlow, a unified flow-matching framework that seamlessly solve both problems. To effectively handle multiple input signals within a Diffusion Transformer (DiT) architecture, we propose a disentangled condition aggregation mechanism leveraging distinct intrinsic properties of attention layers: cross-attention for semantic conditions, and self-attention for temporally-intensive conditions. Besides, contrary to the prevailing belief that joint training for the two tasks leads to performance degradation, we demonstrate that VSSFlow maintains superior performance during end-to-end joint learning process. Furthermore, we use a straightforward feature-level data synthesis method, demonstrating that our framework provides
Authors
(none)
Tags
Stats
Related papers
- V2sflow: Video-to-speech Generation With Speech Decomposition And Rectified Flow (2024)8.52
- Syncflow: Toward Temporally Aligned Joint Audio-video Generation From Text (2024)0.00
- Taming Text-to-sounding Video Generation Via Advanced Modality Condition And Interaction (2025)0.00
- 3mdit: Unified Tri-modal Diffusion Transformer For Text-driven Synchronized Audio-video Generation (2025)0.00
- Flow-tsvad: Target-speaker Voice Activity Detection Via Latent Flow Matching (2024)0.00
- Flowavse: Efficient Audio-visual Speech Enhancement With Conditional Flow Matching (2024)0.00
- Dreamfoley: Scalable Vlms For High-fidelity Video-to-audio Generation (2025)0.00
- Voiceflow: Efficient Text-to-speech With Rectified Flow Matching (2023)0.00