Voicedit: Dual-condition Diffusion Transformer For Environment-aware Speech Synthesis
2024 Β· Jaemin Jung, Junseok Ahn, Chaeyoung Jung, et al.
Abstract
We present VoiceDiT, a multi-modal generative model for producing environment-aware speech and audio from text and visual prompts. While aligning speech with text is crucial for intelligible speech, achieving this alignment in noisy conditions remains a significant and underexplored challenge in the field. To address this, we present a novel audio generation pipeline named VoiceDiT. This pipeline includes three key components: (1) the creation of a large-scale synthetic speech dataset for pre-training and a refined real-world speech dataset for fine-tuning, (2) the Dual-DiT, a model designed to efficiently preserve aligned speech information while accurately reflecting environmental conditions, and (3) a diffusion-based Image-to-Audio Translator that allows the model to bridge the gap between audio and image, facilitating the generation of environmental sound that aligns with the multi-modal prompts. Extensive experimental results demonstrate that VoiceDiT outperforms previous models o
Authors
(none)
Tags
Stats
Related papers
- 3mdit: Unified Tri-modal Diffusion Transformer For Text-driven Synchronized Audio-video Generation (2025)0.00
- Vit-tts: Visual Text-to-speech With Scalable Diffusion Transformer (2023)7.16
- Diffspeaker: Speech-driven 3D Facial Animation With Diffusion Transformer (2024)5.24
- Degdit: Controllable Audio Generation With Dynamic Event Graph Guided Diffusion Transformer (2025)0.00
- Ditto-tts: Diffusion Transformers For Scalable Text-to-speech Without Domain-specific Factors (2024)0.00
- Extract And Diffuse: Latent Integration For Improved Diffusion-based Speech And Vocal Enhancement (2024)0.00
- Controlaudio: Tackling Text-guided, Timing-indicated And Intelligible Audio Generation Via Progressive Diffusion Modeling (2025)0.00
- Ezaudio: Enhancing Text-to-audio Generation With Efficient Diffusion Transformer (2024)7.50