Neural Source-filter-based Waveform Model For Statistical Parametric Speech Synthesis
2018 Β· Xin Wang, Shinji Takaki, Junichi Yamagishi
Abstract
Neural waveform models such as the WaveNet are used in many recent text-to-speech systems, but the original WaveNet is quite slow in waveform generation because of its autoregressive (AR) structure. Although faster non-AR models were recently reported, they may be prohibitively complicated due to the use of a distilling training method and the blend of other disparate training criteria. This study proposes a non-AR neural source-filter waveform model that can be directly trained using spectrum-based training criteria and the stochastic gradient descent method. Given the input acoustic features, the proposed model first uses a source module to generate a sine-based excitation signal and then uses a filter module to transform the excitation signal into the output speech waveform. Our experiments demonstrated that the proposed model generated waveforms at least 100 times faster than the AR WaveNet and the quality of its synthetic speech is close to that of speech generated by the AR WaveN
Authors
(none)
Tags
Stats
Related papers
- Generative Adversarial Network-based Glottal Waveform Model For Statistical Parametric Speech Synthesis (2019)10.35
- Excitnet Vocoder: A Neural Excitation Model For Parametric Speech Synthesis Systems (2018)9.76
- A Comparison Of Recent Waveform Generation And Acoustic Modeling Methods For Neural-network-based Speech Synthesis (2018)11.76
- Parallel Wavegan: A Fast Waveform Generation Model Based On Generative Adversarial Networks With Multi-resolution Spectrogram (2019)0.00
- Speaker-independent Raw Waveform Model For Glottal Excitation (2018)9.76
- Multi-task Wavenet: A Multi-task Generative Model For Statistical Parametric Speech Synthesis Without Fundamental Frequency Conditions (2018)8.09
- A Neural Parametric Singing Synthesizer (2017)10.97
- Efficient Neural Audio Synthesis (2018)0.00