Neural Speech Synthesis With Transformer Network
2018 Β· Naihan Li, Shujie Liu, Yanqing Liu, et al.
Abstract
Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-the-art performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs). Inspired by the success of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2. With the help of multi-head self-attention, the hidden states in the encoder and decoder are constructed in parallel, which improves the training efficiency. Meanwhile, any two inputs at different times are connected directly by self-attention mechanism, which solves the long range dependency problem effectively. Using phoneme sequences as input, our Transformer TTS network generates mel spectrograms, followed by a WaveNet vocoder to output the final audi
Authors
(none)
Tags
Stats
Related papers
- Fastspeech: Fast, Robust And Controllable Text To Speech (2019)0.00
- S-transformer: Segment-transformer For Robust Neural Speech Synthesis (2020)0.00
- Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions (2017)24.07
- Wave-tacotron: Spectrogram-free End-to-end Text-to-speech Synthesis (2020)12.81
- Transfer Learning From Speaker Verification To Multispeaker Text-to-speech Synthesis (2018)0.00
- Transformer-transducer: End-to-end Speech Recognition With Self-attention (2019)0.00
- Neural Hmms Are All You Need (for High-quality Attention-free TTS) (2021)7.50
- Graphspeech: Syntax-aware Graph Attention Network For Neural Speech Synthesis (2020)7.50