Grad-stylespeech: Any-speaker Adaptive Text-to-speech Synthesis With Diffusion Models
2022 Β· Minki Kang, Dongchan Min, Sung Ju Hwang
Abstract
There has been a significant progress in Text-To-Speech (TTS) synthesis technology in recent years, thanks to the advancement in neural generative modeling. However, existing methods on any-speaker adaptive TTS have achieved unsatisfactory performance, due to their suboptimal accuracy in mimicking the target speakers' styles. In this work, we present Grad-StyleSpeech, which is an any-speaker adaptive TTS framework that is based on a diffusion model that can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech. Grad-StyleSpeech significantly outperforms recent speaker-adaptive TTS baselines on English benchmarks. Audio samples are available at https://nardien.github.io/grad-stylespeech-demo.
Authors
(none)
Tags
Stats
Related papers
- Multi-gradspeech: Towards Diffusion-based Multi-speaker Text-to-speech Using Consistent Diffusion Models (2023)0.00
- Styletts-zs: Efficient High-quality Zero-shot Text-to-speech Synthesis With Distilled Time-varying Style Diffusion (2024)3.58
- Diffstyletts: Diffusion-based Hierarchical Prosody Modeling For Text-to-speech With Diverse And Controllable Styles (2024)0.00
- Style Description Based Text-to-speech With Conditional Prosodic Layer Normalization Based Diffusion GAN (2023)0.00
- Styletts: A Style-based Generative Model For Natural And Diverse Text-to-speech Synthesis (2022)10.97
- Styletts 2: Towards Human-level Text-to-speech Through Style Diffusion And Adversarial Training With Large Speech Language Models (2023)8.09
- Guided-tts 2: A Diffusion Model For High-quality Adaptive Text-to-speech With Untranscribed Data (2022)0.00
- DEX-TTS: Diffusion-based Expressive Text-to-speech With Style Modeling On Time Variability (2024)0.00