Nnspeech: Speaker-guided Conditional Variational Autoencoder For Zero-shot Multi-speaker Text-to-speech
2022 Β· Botao Zhao, Xulong Zhang, Jianzong Wang, et al.
Abstract
Multi-speaker text-to-speech (TTS) using a few adaption data is a challenge in practical applications. To address that, we propose a zero-shot multi-speaker TTS, named nnSpeech, that could synthesis a new speaker voice without fine-tuning and using only one adaption utterance. Compared with using a speaker representation module to extract the characteristics of new speakers, our method bases on a speaker-guided conditional variational autoencoder and can generate a variable Z, which contains both speaker characteristics and content information. The latent variable Z distribution is approximated by another variable conditioned on reference mel-spectrogram and phoneme. Experiments on the English corpus, Mandarin corpus, and cross-dataset proves that our model could generate natural and similar speech with only one adaption speech.
Authors
(none)
Tags
Stats
Related papers
- ZMM-TTS: Zero-shot Multilingual And Multispeaker Speech Synthesis Conditioned On Self-supervised Discrete Speech Representations (2023)10.35
- Zero-shot Multi-speaker Text-to-speech With State-of-the-art Neural Speaker Embeddings (2019)15.67
- Content-dependent Fine-grained Speaker Embedding For Zero-shot Speaker Adaptation In Text-to-speech Synthesis (2022)10.07
- Generalizable Zero-shot Speaker Adaptive Speech Synthesis With Disentangled Representations (2023)6.34
- Noise-robust Zero-shot Text-to-speech Synthesis Conditioned On Self-supervised Speech-representation Model With Adapters (2024)7.50
- Sample Efficient Adaptive Text-to-speech (2018)0.00
- Adversarial Speaker-consistency Learning Using Untranscribed Speech Data For Zero-shot Multi-speaker Text-to-speech (2022)4.52
- Enhancing Zero-shot Multi-speaker TTS With Negated Speaker Representations (2024)3.58