DPI-TTS: Directional Patch Interaction For Fast-converging And Style Temporal Modeling In Text-to-speech
2024 Β· Xin Qi, Ruibo Fu, Zhengqi Wen, et al.
Abstract
In recent years, speech diffusion models have advanced rapidly. Alongside the widely used U-Net architecture, transformer-based models such as the Diffusion Transformer (DiT) have also gained attention. However, current DiT speech models treat Mel spectrograms as general images, which overlooks the specific acoustic properties of speech. To address these limitations, we propose a method called Directional Patch Interaction for Text-to-Speech (DPI-TTS), which builds on DiT and achieves fast training without compromising accuracy. Notably, DPI-TTS employs a low-to-high frequency, frame-by-frame progressive inference approach that aligns more closely with acoustic properties, enhancing the naturalness of the generated speech. Additionally, we introduce a fine-grained style temporal modeling method that further improves speaker style similarity. Experimental results demonstrate that our method increases the training speed by nearly 2 times and significantly outperforms the baseline models.
Authors
(none)
Tags
Stats
Related papers
- DEX-TTS: Diffusion-based Expressive Text-to-speech With Style Modeling On Time Variability (2024)0.00
- Ditto-tts: Diffusion Transformers For Scalable Text-to-speech Without Domain-specific Factors (2024)0.00
- DCTTS: Discrete Diffusion Model With Contrastive Learning For Text-to-speech Generation (2023)5.72
- Diffstyletts: Diffusion-based Hierarchical Prosody Modeling For Text-to-speech With Diverse And Controllable Styles (2024)0.00
- Styletts-zs: Efficient High-quality Zero-shot Text-to-speech Synthesis With Distilled Time-varying Style Diffusion (2024)3.58
- Diffgan-tts: High-fidelity And Efficient Text-to-speech With Denoising Diffusion Gans (2022)0.00
- Fastdiff: A Fast Conditional Diffusion Model For High-quality Speech Synthesis (2022)14.35
- Prodiff: Progressive Fast Diffusion Model For High-quality Text-to-speech (2022)0.00