Naturalspeech 2: Latent Diffusion Models Are Natural And Zero-shot Speech And Singing Synthesizers
2023 Β· Kai Shen, Zeqian Ju, Xu Tan, et al.
Abstract
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is important to capture the diversity in human speech such as speaker identities, prosodies, and styles (e.g., singing). Current large TTS systems usually quantize speech into discrete tokens and use language models to generate these tokens one by one, which suffer from unstable prosody, word skipping/repeating issue, and poor voice quality. In this paper, we develop NaturalSpeech 2, a TTS system that leverages a neural audio codec with residual vector quantizers to get the quantized latent vectors and uses a diffusion model to generate these latent vectors conditioned on text input. To enhance the zero-shot capability that is important to achieve diverse speech synthesis, we design a speech prompting mechanism to facilitate in-context learning in the diffusion model and the duration/pitch predictor. We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate
Authors
(none)
Tags
Stats
Related papers
- Naturalspeech 3: Zero-shot Speech Synthesis With Factorized Codec And Diffusion Models (2024)0.00
- Simplespeech: Towards Simple And Efficient Text-to-speech With Scalar Latent Transformer Diffusion Models (2024)8.60
- Ditto-tts: Diffusion Transformers For Scalable Text-to-speech Without Domain-specific Factors (2024)0.00
- Styletts-zs: Efficient High-quality Zero-shot Text-to-speech Synthesis With Distilled Time-varying Style Diffusion (2024)3.58
- High-fidelity Speech Synthesis With Minimal Supervision: All Using Diffusion Models (2023)5.24
- Mega-tts: Zero-shot Text-to-speech At Scale With Intrinsic Inductive Bias (2023)0.00
- Hiddensinger: High-quality Singing Voice Synthesis Via Neural Audio Codec And Latent Diffusion Models (2023)0.00
- Megatts 3: Sparse Alignment Enhanced Latent Diffusion Transformer For Zero-shot Speech Synthesis (2025)0.00