Yourtts: Towards Zero-shot Multi-speaker TTS And Zero-shot Voice Conversion For Everyone
2021 Β· Edresson Casanova, Julian Weber, Christopher Shulby, et al.
Abstract
YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zero-shot multi-speaker and multilingual training. We achieved state-of-the-art (SOTA) results in zero-shot multi-speaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset. Additionally, our approach achieves promising results in a target language with a single-speaker dataset, opening possibilities for zero-shot multi-speaker TTS and zero-shot voice conversion systems in low-resource languages. Finally, it is possible to fine-tune the YourTTS model with less than 1 minute of speech and achieve state-of-the-art results in voice similarity and with reasonable quality. This is important to allow synthesis for speakers with a very different voice or recording characteristics from those seen during training.
Authors
(none)
Tags
Stats
Related papers
- ZMM-TTS: Zero-shot Multilingual And Multispeaker Speech Synthesis Conditioned On Self-supervised Discrete Speech Representations (2023)10.35
- The Zero Resource Speech Challenge 2019: TTS Without T (2019)13.17
- Learning To Speak From Text: Zero-shot Multilingual Text-to-speech With Unsupervised Text Pretraining (2023)8.82
- Automatic Tuning Of Loss Trade-offs Without Hyper-parameter Search In End-to-end Zero-shot Speech Synthesis (2023)3.58
- Improvement Speaker Similarity For Zero-shot Any-to-any Voice Conversion Of Whispered And Regular Speech (2024)4.52
- Rosettaspeech: Zero-shot Speech-to-speech Translation Without Parallel Speech (2025)0.00
- Nnspeech: Speaker-guided Conditional Variational Autoencoder For Zero-shot Multi-speaker Text-to-speech (2022)9.59
- Stargan-zsvc: Towards Zero-shot Voice Conversion In Low-resource Contexts (2021)3.58