ZMM-TTS: Zero-shot Multilingual And Multispeaker Speech Synthesis Conditioned On Self-supervised Discrete Speech Representations
2023 Β· Cheng Gong, Xin Wang, Erica Cooper, et al.
Abstract
Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality audio data. TTS systems are typically built using a single speaker's voices, but there is growing interest in developing systems that can synthesize voices for new speakers using only a few seconds of their speech. This paper presents ZMM-TTS, a multilingual and multispeaker framework utilizing quantized latent speech representations from a large-scale, pre-trained, self-supervised model. Our paper combines text-based and speech-based self-supervised learning models for multilingual speech synthesis. Our proposed model has zero-shot generalization ability not only for unseen speakers but also for unseen languages. We have conducted comprehensive subjective and objective evaluations through a series of experiments. Our model has proven effective in terms
Authors
(none)
Tags
Stats
Related papers
- Learning To Speak From Text: Zero-shot Multilingual Text-to-speech With Unsupervised Text Pretraining (2023)8.82
- Nnspeech: Speaker-guided Conditional Variational Autoencoder For Zero-shot Multi-speaker Text-to-speech (2022)9.59
- Yourtts: Towards Zero-shot Multi-speaker TTS And Zero-shot Voice Conversion For Everyone (2021)0.00
- Noise-robust Zero-shot Text-to-speech Synthesis Conditioned On Self-supervised Speech-representation Model With Adapters (2024)7.50
- Efficient Neural Speech Synthesis For Low-resource Languages Through Multilingual Modeling (2020)8.60
- HAM-TTS: Hierarchical Acoustic Modeling For Token-based Zero-shot Text-to-speech With Model And Data Scaling (2024)0.00
- An Initial Investigation Of Language Adaptation For TTS Systems Under Low-resource Scenarios (2024)3.58
- Adversarial Speaker-consistency Learning Using Untranscribed Speech Data For Zero-shot Multi-speaker Text-to-speech (2022)4.52