Hierarchical Context-aware Transformers For Non-autoregressive Text To Speech
2021 Β· Jae-Sung Bae, Tae-Jun Bak, Young-Sun Joo, et al.
Abstract
In this paper, we propose methods for improving the modeling performance of a Transformer-based non-autoregressive text-to-speech (TNA-TTS) model. Although the text encoder and audio decoder handle different types and lengths of data (i.e., text and audio), the TNA-TTS models are not designed considering these variations. Therefore, to improve the modeling performance of the TNA-TTS model we propose a hierarchical Transformer structure-based text encoder and audio decoder that are designed to accommodate the characteristics of each module. For the text encoder, we constrain each self-attention layer so the encoder focuses on a text sequence from the local to the global scope. Conversely, the audio decoder constrains its self-attention layers to focus in the reverse direction, i.e., from global to local scope. Additionally, we further improve the pitch modeling accuracy of the audio decoder by providing sentence and word-level pitch as conditions. Various objective and subjective evalua
Authors
(none)
Tags
Stats
Related papers
- Hierarchical Prosody Modeling And Control In Non-autoregressive Parallel Neural TTS (2021)8.35
- Context-aware Coherent Speaking Style Prediction With Hierarchical Transformers For Audiobook Speech Synthesis (2023)5.24
- Hierarchical Prosody Modeling For Non-autoregressive Speech Synthesis (2020)10.07
- Improving Transformer-based Conversational ASR By Inter-sentential Attention Mechanism (2022)7.50
- Neural Speech Synthesis With Transformer Network (2018)19.95
- Hierarchical And Multi-scale Variational Autoencoder For Diverse And Natural Non-autoregressive Text-to-speech (2022)3.58
- S-transformer: Segment-transformer For Robust Neural Speech Synthesis (2020)0.00
- Transformer Transducer: One Model Unifying Streaming And Non-streaming Speech Recognition (2020)0.00