Efficient Neural Audio Synthesis
2018 Β· Nal Kalchbrenner, Erich Elsen, Karen Simonyan, et al.
Abstract
Sequential models achieve state-of-the-art results in audio, visual and textual domains with respect to both estimating the data distribution and generating high-quality samples. Efficient sampling for this class of models has however remained an elusive problem. With a focus on text-to-speech synthesis, we describe a set of general techniques for reducing sampling time while maintaining high output quality. We first describe a single-layer recurrent neural network, the WaveRNN, with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model. The compact form of the network makes it possible to generate 24kHz 16-bit audio 4x faster than real time on a GPU. Second, we apply a weight pruning technique to reduce the number of weights in the WaveRNN. We find that, for a constant number of parameters, large sparse networks perform better than small dense networks and this relationship holds for sparsity levels beyond 96%. The small number of weights in a Sparse Wave
Authors
(none)
Tags
Stats
Related papers
- Lpcnet: Improving Neural Speech Synthesis Through Linear Prediction (2018)0.00
- Parallel Wavenet: Fast High-fidelity Speech Synthesis (2017)0.00
- Pretraining Strategies, Waveform Model Choice, And Acoustic Configurations For Multi-speaker End-to-end Speech Synthesis (2020)0.00
- Neural Source-filter-based Waveform Model For Statistical Parametric Speech Synthesis (2018)13.97
- Wavenet: A Generative Model For Raw Audio (2016)0.00
- Parallel Wavegan: A Fast Waveform Generation Model Based On Generative Adversarial Networks With Multi-resolution Spectrogram (2019)0.00
- Wasserstein GAN And Waveform Loss-based Acoustic Model Training For Multi-speaker Text-to-speech Synthesis Systems Using A Wavenet Vocoder (2018)12.61
- Gansynth: Adversarial Neural Audio Synthesis (2019)0.00