Styletts 2: Towards Human-level Text-to-speech Through Style Diffusion And Adversarial Training With Large Speech Language Models
2023 Β· Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, et al.
Abstract
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker
Authors
(none)
Tags
Stats
Related papers
- Styletts-zs: Efficient High-quality Zero-shot Text-to-speech Synthesis With Distilled Time-varying Style Diffusion (2024)3.58
- Styletts: A Style-based Generative Model For Natural And Diverse Text-to-speech Synthesis (2022)10.97
- Stylespeech: Parameter-efficient Fine Tuning For Pre-trained Controllable Text-to-speech (2024)6.34
- Styles2st: Zero-shot Style Transfer For Direct Speech-to-speech Translation (2023)0.00
- Generalized Multilingual Text-to-speech Generation With Language-aware Style Adaptation (2025)0.00
- Diffstyletts: Diffusion-based Hierarchical Prosody Modeling For Text-to-speech With Diverse And Controllable Styles (2024)0.00
- Grad-stylespeech: Any-speaker Adaptive Text-to-speech Synthesis With Diffusion Models (2022)0.00
- Interpretable Style Transfer For Text-to-speech With Controlvae And Diffusion Bridge (2023)5.24