STFT Spectral Loss For Training A Neural Speech Waveform Model
2018 Β· Shinji Takaki, Toru Nakashika, Xin Wang, et al.
Abstract
This paper proposes a new loss using short-time Fourier transform (STFT) spectra for the aim of training a high-performance neural speech waveform model that predicts raw continuous speech waveform samples directly. Not only amplitude spectra but also phase spectra obtained from generated speech waveforms are used to calculate the proposed loss. We also mathematically show that training of the waveform model on the basis of the proposed loss can be interpreted as maximum likelihood training that assumes the amplitude and phase spectra of generated speech waveforms following Gaussian and von Mises distributions, respectively. Furthermore, this paper presents a simple network architecture as the speech waveform model, which is composed of uni-directional long short-term memories (LSTMs) and an auto-regressive structure. Experimental results showed that the proposed neural model synthesized high-quality speech waveforms.
Authors
(none)
Tags
Stats
Related papers
- Wasserstein GAN And Waveform Loss-based Acoustic Model Training For Multi-speaker Text-to-speech Synthesis Systems Using A Wavenet Vocoder (2018)12.61
- Low-latency Neural Speech Phase Prediction Based On Parallel Estimation Architecture And Anti-wrapping Losses For Speech Generation Tasks (2024)6.34
- Learning Waveform-based Acoustic Models Using Deep Variational Convolutional Neural Networks (2019)6.77
- Neural Speech Phase Prediction Based On Parallel Estimation Architecture And Anti-wrapping Losses (2022)11.39
- Weighted Speech Distortion Losses For Neural-network-based Real-time Speech Enhancement (2020)14.51
- High-fidelity And Low-latency Universal Neural Vocoder Based On Multiband Wavernn With Data-driven Linear Prediction For Discrete Waveform Modeling (2021)6.77
- Phase Continuity: Learning Derivatives Of Phase Spectrum For Speech Enhancement (2022)6.77
- A Consolidated View Of Loss Functions For Supervised Deep Learning-based Speech Enhancement (2020)13.93