Learning To Speak From Text: Zero-shot Multilingual Text-to-speech With Unsupervised Text Pretraining
2023 Β· Takaaki Saeki, Soumi Maiti, Xinjian Li, et al.
Abstract
While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data. This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource languages for which only textual resources are available, making TTS accessible to thousands of languages. Inspired by the strong cross-lingual transferability of multilingual language models, our framework first performs masked language model pretraining with multilingual text-only data. Then we train this model with a paired data in a supervised manner, while freezing a language-aware embedding layer. This allows inference even for languages not included in the paired data but present in the text-only data. Evaluation results demonstrate highly intelligible zero-shot TTS with a character error
Authors
(none)
Tags
Stats
Related papers
- ZMM-TTS: Zero-shot Multilingual And Multispeaker Speech Synthesis Conditioned On Self-supervised Discrete Speech Representations (2023)10.35
- Extending Multilingual Speech Synthesis To 100+ Languages Without Transcribed Data (2024)7.16
- End-to-end Text-to-speech For Low-resource Languages By Cross-lingual Transfer Learning (2019)0.00
- Transfer Learning Framework For Low-resource Text-to-speech Using A Large-scale Unlabeled Speech Corpus (2022)10.21
- Unsupervised Learning For Sequence-to-sequence Text-to-speech For Low-resource Languages (2020)9.59
- The Zero Resource Speech Challenge 2019: TTS Without T (2019)13.17
- Yourtts: Towards Zero-shot Multi-speaker TTS And Zero-shot Voice Conversion For Everyone (2021)0.00
- Building A Mixed-lingual Neural TTS System With Only Monolingual Data (2019)0.00