Building A Mixed-lingual Neural TTS System With Only Monolingual Data
2019 Β· Liumeng Xue, Wei Song, Guanghui Xu, et al.
Abstract
When deploying a Chinese neural text-to-speech (TTS) synthesis system, one of the challenges is to synthesize Chinese utterances with English phrases or words embedded. This paper looks into the problem in the encoder-decoder framework when only monolingual data from a target speaker is available. Specifically, we view the problem from two aspects: speaker consistency within an utterance and naturalness. We start the investigation with an Average Voice Model which is built from multi-speaker monolingual data, i.e. Mandarin and English data. On the basis of that, we look into speaker embedding for speaker consistency within an utterance and phoneme embedding for naturalness and intelligibility and study the choice of data for model training. We report the findings and discuss the challenges to build a mixed-lingual TTS system with only monolingual data.
Authors
(none)
Tags
Stats
Related papers
- Towards Natural Bilingual And Code-switched Speech Synthesis Based On Mix Of Monolingual Recordings And Cross-lingual Voice Conversion (2020)0.00
- Combining Speakers Of Multiple Languages To Improve Quality Of Neural Voices (2021)5.24
- Efficient Neural Speech Synthesis For Low-resource Languages Through Multilingual Modeling (2020)8.60
- Cross-lingual Multi-speaker Text-to-speech Synthesis For Voice Cloning Without Using Parallel Corpus For Unseen Speakers (2019)0.00
- Training Multi-speaker Neural Text-to-speech Systems Using Speaker-imbalanced Speech Corpora (2019)8.09
- Learning To Speak From Text: Zero-shot Multilingual Text-to-speech With Unsupervised Text Pretraining (2023)8.82
- Extending Multilingual Speech Synthesis To 100+ Languages Without Transcribed Data (2024)7.16
- ZMM-TTS: Zero-shot Multilingual And Multispeaker Speech Synthesis Conditioned On Self-supervised Discrete Speech Representations (2023)10.35