Initial Investigation Of An Encoder-decoder End-to-end TTS Framework Using Marginalization Of Monotonic Hard Latent Alignments
2019 Β· Yusuke Yasuda, Xin Wang, Junichi Yamagishi
Abstract
End-to-end text-to-speech (TTS) synthesis is a method that directly converts input text to output acoustic features using a single network. A recent advance of end-to-end TTS is due to a key technique called attention mechanisms, and all successful methods proposed so far have been based on soft attention mechanisms. However, although network structures are becoming increasingly complex, end-to-end TTS systems with soft attention mechanisms may still fail to learn and to predict accurate alignment between the input and output. This may be because the soft attention mechanisms are too flexible. Therefore, we propose an approach that has more explicit but natural constraints suitable for speech signals to make alignment learning and prediction of end-to-end TTS systems more robust. The proposed system, with the constrained alignment scheme borrowed from segment-to-segment neural transduction (SSNT), directly calculates the joint probability of acoustic features and alignment given an inp
Authors
(none)
Tags
Stats
Related papers
- Reinforce-aligner: Reinforcement Alignment Search For Robust End-to-end Text-to-speech (2021)8.09
- Improving Robustness Of Llm-based Speech Synthesis By Learning Monotonic Alignment (2024)0.00
- Robust Sequence-to-sequence Acoustic Modeling With Stepwise Monotonic Attention For Neural TTS (2019)11.49
- Optimizing Alignment Of Speech And Language Latent Spaces For End-to-end Speech Recognition And Understanding (2021)9.03
- Moboaligner: A Neural Alignment Model For Non-autoregressive TTS With Monotonic Boundary Search (2020)2.26
- Efficienttts: An Efficient And High-quality Text-to-speech Architecture (2020)0.00
- Megatts 3: Sparse Alignment Enhanced Latent Diffusion Transformer For Zero-shot Speech Synthesis (2025)0.00
- Forward-backward Decoding For Regularizing End-to-end TTS (2019)6.77