S-transformer: Segment-transformer For Robust Neural Speech Synthesis
2020 Β· Xi Wang, Huaiping Ming, Lei He, et al.
Abstract
Neural end-to-end text-to-speech (TTS) , which adopts either a recurrent model, e.g. Tacotron, or an attention one, e.g. Transformer, to characterize a speech utterance, has achieved significant improvement of speech synthesis. However, it is still very challenging to deal with different sentence lengths, particularly, for long sentences where sequence model has limitation of the effective context length. We propose a novel segment-Transformer (s-Transformer), which models speech at segment level where recurrence is reused via cached memories for both the encoder and decoder. Long-range contexts can be captured by the extended memory, meanwhile, the encoder-decoder attention on segment which is much easier to handle. In addition, we employ a modified relative positional self attention to generalize sequence length beyond a period possibly unseen in the training data. By comparing the proposed s-Transformer with the standard Transformer, on short sentences, both achieve the same MOS sco
Authors
(none)
Tags
Stats
Related papers
- Neural Speech Synthesis With Transformer Network (2018)19.95
- Fastspeech: Fast, Robust And Controllable Text To Speech (2019)0.00
- Transformer-transducer: End-to-end Speech Recognition With Self-attention (2019)0.00
- Synchronous Transformers For End-to-end Speech Recognition (2019)12.02
- T-GSA: Transformer With Gaussian-weighted Self-attention For Speech Enhancement (2019)15.95
- Simple And Effective Multi-sentence TTS With Expressive And Coherent Prosody (2022)7.16
- Simplified Self-attention For Transformer-based End-to-end Speech Recognition (2020)10.61
- Segaug: Ctc-aligned Segmented Augmentation For Robust Rnn-transducer Based Speech Recognition (2025)3.58