Multi-gradspeech: Towards Diffusion-based Multi-speaker Text-to-speech Using Consistent Diffusion Models
2023 Β· Heyang Xue, Shuai Guo, Pengcheng Zhu, et al.
Abstract
Despite imperfect score-matching causing drift in training and sampling distributions of diffusion models, recent advances in diffusion-based acoustic models have revolutionized data-sufficient single-speaker Text-to-Speech (TTS) approaches, with Grad-TTS being a prime example. However, the sampling drift problem leads to these approaches struggling in multi-speaker scenarios in practice due to more complex target data distribution compared to single-speaker scenarios. In this paper, we present Multi-GradSpeech, a multi-speaker diffusion-based acoustic models which introduces the Consistent Diffusion Model (CDM) as a generative modeling approach. We enforce the consistency property of CDM during the training process to alleviate the sampling drift problem in the inference stage, resulting in significant improvements in multi-speaker TTS performance. Our experimental results corroborate that our proposed approach can improve the performance of different speakers involved in multi-speake
Authors
(none)
Tags
Stats
Related papers
- Grad-stylespeech: Any-speaker Adaptive Text-to-speech Synthesis With Diffusion Models (2022)0.00
- CM-TTS: Enhancing Real Time Text-to-speech Synthesis Efficiency Through Weighted Samplers And Consistency Models (2024)5.24
- Consistencytta: Accelerating Diffusion-based Text-to-audio Generation With Consistency Distillation (2023)6.77
- DCTTS: Discrete Diffusion Model With Contrastive Learning For Text-to-speech Generation (2023)5.72
- Adversarial Training Of Denoising Diffusion Model Using Dual Discriminators For High-fidelity Multi-speaker TTS (2023)2.26
- Controlaudio: Tackling Text-guided, Timing-indicated And Intelligible Audio Generation Via Progressive Diffusion Modeling (2025)0.00
- High-fidelity Speech Synthesis With Minimal Supervision: All Using Diffusion Models (2023)5.24
- Diffstyletts: Diffusion-based Hierarchical Prosody Modeling For Text-to-speech With Diverse And Controllable Styles (2024)0.00