Pretraining Strategies, Waveform Model Choice, And Acoustic Configurations For Multi-speaker End-to-end Speech Synthesis
2020 Β· Erica Cooper, Xin Wang, Yi Zhao, et al.
Abstract
We explore pretraining strategies including choice of base corpus with the aim of choosing the best strategy for zero-shot multi-speaker end-to-end synthesis. We also examine choice of neural vocoder for waveform synthesis, as well as acoustic configurations used for mel spectrograms and final audio output. We find that fine-tuning a multi-speaker model from found audiobook data that has passed a simple quality threshold can improve naturalness and similarity to unseen target speakers of synthetic speech. Additionally, we find that listeners can discern between a 16kHz and 24kHz sampling rate, and that WaveRNN produces output waveforms of a comparable quality to WaveNet, with a faster inference time.
Authors
(none)
Tags
Stats
Related papers
- Wasserstein GAN And Waveform Loss-based Acoustic Model Training For Multi-speaker Text-to-speech Synthesis Systems Using A Wavenet Vocoder (2018)12.61
- A Comparison Of Recent Waveform Generation And Acoustic Modeling Methods For Neural-network-based Speech Synthesis (2018)11.76
- Efficient Neural Audio Synthesis (2018)0.00
- Rawnet: Advanced End-to-end Deep Neural Network Using Raw Waveforms For Text-independent Speaker Verification (2019)15.34
- Naturall2s: End-to-end High-quality Multispeaker Lip-to-speech Synthesis With Differential Digital Signal Processing (2025)0.00
- Speaker Conditional Wavernn: Towards Universal Neural Vocoder For Unseen Speaker And Recording Conditions (2020)8.60
- Modeling Multi-speaker Latent Space To Improve Neural TTS: Quick Enrolling New Speaker And Enhancing Premium Voice (2018)0.00
- Speaker-adaptive Neural Vocoders For Parametric Speech Synthesis Systems (2018)2.26