Transfer Learning From Speaker Verification To Multispeaker Text-to-speech Synthesis
2018 Β· Ye Jia, Yu Zhang, Ron J. Weiss, et al.
Abstract
We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize n
Authors
(none)
Tags
Stats
Related papers
- Deep Voice 2: Multi-speaker Neural Text-to-speech (2017)0.00
- Voice Cloning: A Multi-speaker Text-to-speech Synthesis Approach Based On Transfer Learning (2021)0.00
- Transfer Learning From Speech Synthesis To Voice Conversion With Non-parallel Training Data (2020)12.74
- Speaker Verification-derived Loss And Data Augmentation For Dnn-based Multispeaker Speech Synthesis (2021)3.58
- Neural Speech Synthesis With Transformer Network (2018)19.95
- Learning To Speak Fluently In A Foreign Language: Multilingual Speech Synthesis And Cross-language Voice Cloning (2019)15.03
- Learning Speaker Embedding From Text-to-speech (2020)5.84
- Voice Imitating Text-to-speech Neural Networks (2018)0.00