Ditto-tts: Diffusion Transformers For Scalable Text-to-speech Without Domain-specific Factors
2024 Β· Keon Lee, Dong Won Kim, Jaehyeon Kim, et al.
Abstract
Large-scale latent diffusion models (LDMs) excel in content generation across various modalities, but their reliance on phonemes and durations in text-to-speech (TTS) limits scalability and access from other fields. While recent studies show potential in removing these domain-specific factors, performance remains suboptimal. In this work, we introduce DiTTo-TTS, a Diffusion Transformer (DiT)-based TTS model, to investigate whether LDM-based TTS can achieve state-of-the-art performance without domain-specific factors. Through rigorous analysis and empirical exploration, we find that (1) DiT with minimal modifications outperforms U-Net, (2) variable-length modeling with a speech length predictor significantly improves results over fixed-length approaches, and (3) conditions like semantic alignment in speech latent representations are key to further enhancement. By scaling our training data to 82K hours and the model size to 790M parameters, we achieve superior or comparable zero-shot per
Authors
(none)
Tags
Stats
Related papers
- DCTTS: Discrete Diffusion Model With Contrastive Learning For Text-to-speech Generation (2023)5.72
- Megatts 3: Sparse Alignment Enhanced Latent Diffusion Transformer For Zero-shot Speech Synthesis (2025)0.00
- Vit-tts: Visual Text-to-speech With Scalable Diffusion Transformer (2023)7.16
- Naturalspeech 2: Latent Diffusion Models Are Natural And Zero-shot Speech And Singing Synthesizers (2023)0.00
- DPI-TTS: Directional Patch Interaction For Fast-converging And Style Temporal Modeling In Text-to-speech (2024)2.26
- Voicedit: Dual-condition Diffusion Transformer For Environment-aware Speech Synthesis (2024)5.84
- Diffgan-tts: High-fidelity And Efficient Text-to-speech With Denoising Diffusion Gans (2022)0.00
- High-fidelity Speech Synthesis With Minimal Supervision: All Using Diffusion Models (2023)5.24