Pretraining Strategies, Waveform Model Choice, And Acoustic Configurations For Multi-speaker End-to-end Speech Synthesis

Abstract

We explore pretraining strategies including choice of base corpus with the aim of choosing the best strategy for zero-shot multi-speaker end-to-end synthesis. We also examine choice of neural vocoder for waveform synthesis, as well as acoustic configurations used for mel spectrograms and final audio output. We find that fine-tuning a multi-speaker model from found audiobook data that has passed a simple quality threshold can improve naturalness and similarity to unseen target speakers of synthetic speech. Additionally, we find that listeners can discern between a 16kHz and 24kHz sampling rate, and that WaveRNN produces output waveforms of a comparable quality to WaveNet, with a faster inference time.

Pretraining Strategies, Waveform Model Choice, And Acoustic Configurations For Multi-speaker End-to-end Speech Synthesis

Abstract

Authors

Tags

Stats

Related papers