A Unified Speaker Adaptation Method For Speech Synthesis Using Transcribed And Untranscribed Speech With Backpropagation
2019 Β· Hieu-Thi Luong, Junichi Yamagishi
Abstract
By representing speaker characteristic as a single fixed-length vector extracted solely from speech, we can train a neural multi-speaker speech synthesis model by conditioning the model on those vectors. This model can also be adapted to unseen speakers regardless of whether the transcript of adaptation data is available or not. However, this setup restricts the speaker component to just a single bias vector, which in turn limits the performance of adaptation process. In this study, we propose a novel speech synthesis model, which can be adapted to unseen speakers by fine-tuning part of or all of the network using either transcribed or untranscribed speech. Our methodology essentially consists of two steps: first, we split the conventional acoustic model into a speaker-independent (SI) linguistic encoder and a speaker-adaptive (SA) acoustic decoder; second, we train an auxiliary acoustic encoder that can be used as a substitute for the linguistic encoder whenever linguistic features ar
Authors
(none)
Tags
Stats
Related papers
- Multimodal Speech Synthesis Architecture For Unsupervised Speaker Adaptation (2018)6.34
- Linear Networks Based Speaker Adaptation For Speech Synthesis (2018)6.34
- Speaker-adaptive Neural Vocoders For Parametric Speech Synthesis Systems (2018)2.26
- Sample Efficient Adaptive Text-to-speech (2018)0.00
- Bayesian Learning For Deep Neural Network Adaptation (2020)9.76
- Scaling And Bias Codes For Modeling Speaker-adaptive Dnn-based Speech Synthesis Systems (2018)6.34
- Zero-shot Multi-speaker Text-to-speech With State-of-the-art Neural Speaker Embeddings (2019)15.67
- Nnspeech: Speaker-guided Conditional Variational Autoencoder For Zero-shot Multi-speaker Text-to-speech (2022)9.59