Neural Sequence-to-sequence Speech Synthesis Using A Hidden Semi-markov Model Based Structured Attention Mechanism
2021 Β· Yoshihiko Nankaku, Kenta Sumiya, Takenori Yoshimura, et al.
Abstract
This paper proposes a novel Sequence-to-Sequence (Seq2Seq) model integrating the structure of Hidden Semi-Markov Models (HSMMs) into its attention mechanism. In speech synthesis, it has been shown that methods based on Seq2Seq models using deep neural networks can synthesize high quality speech under the appropriate conditions. However, several essential problems still have remained, i.e., requiring large amounts of training data due to an excessive degree for freedom in alignment (mapping function between two sequences), and the difficulty in handling duration due to the lack of explicit duration modeling. The proposed method defines a generative models to realize the simultaneous optimization of alignments and model parameters based on the Variational Auto-Encoder (VAE) framework, and provides monotonic alignments and explicit duration modeling based on the structure of HSMM. The proposed method can be regarded as an integration of Hidden Markov Model (HMM) based speech synthesis and
Authors
(none)
Tags
Stats
Related papers
- Neural Hmms Are All You Need (for High-quality Attention-free TTS) (2021)7.50
- Forward Attention In Sequence-to-sequence Acoustic Modelling For Speech Synthesis (2018)12.10
- Continuous Autoregressive Modeling With Stochastic Monotonic Alignment For Speech Synthesis (2025)0.00
- Hierarchical Generative Modeling For Controllable Speech Synthesis (2018)0.00
- Singing Voice Synthesis Based On A Musical Note Position-aware Attention Mechanism (2022)0.00
- Robust Sequence-to-sequence Acoustic Modeling With Stepwise Monotonic Attention For Neural TTS (2019)11.49
- On Using 2D Sequence-to-sequence Models For Speech Recognition (2019)0.00
- Supervised Attention In Sequence-to-sequence Models For Speech Recognition (2022)5.84