Robust Sequence-to-sequence Acoustic Modeling With Stepwise Monotonic Attention For Neural TTS
2019 Β· Mutian He, Yan Deng, Lei He
Abstract
Neural TTS has demonstrated strong capabilities to generate human-like speech with high quality and naturalness, while its generalization to out-of-domain texts is still a challenging task, with regard to the design of attention-based sequence-to-sequence acoustic modeling. Various errors occur in those inputs with unseen context, including attention collapse, skipping, repeating, etc., which limits the broader applications. In this paper, we propose a novel stepwise monotonic attention method in sequence-to-sequence acoustic modeling to improve the robustness on out-of-domain inputs. The method utilizes the strict monotonic property in TTS with constraints on monotonic hard attention that the alignments between inputs and outputs sequence must be not only monotonic but allowing no skipping on inputs. Soft attention could be used to evade mismatch between training and inference. The experimental results show that the proposed method could achieve significant improvements in robustness
Authors
(none)
Tags
Stats
Related papers
- Forward Attention In Sequence-to-sequence Acoustic Modelling For Speech Synthesis (2018)12.10
- Neural Hmms Are All You Need (for High-quality Attention-free TTS) (2021)7.50
- Improving Robustness Of Llm-based Speech Synthesis By Learning Monotonic Alignment (2024)0.00
- Feathertts: Robust And Efficient Attention Based Neural TTS (2020)5.84
- Initial Investigation Of An Encoder-decoder End-to-end TTS Framework Using Marginalization Of Monotonic Hard Latent Alignments (2019)0.00
- Regotron: Regularizing The Tacotron2 Architecture Via Monotonic Alignment Loss (2022)5.24
- Speaking Style Adaptation In Text-to-speech Synthesis Using Sequence-to-sequence Models With Attention (2018)0.00
- Non-autoregressive TTS With Explicit Duration Modelling For Low-resource Highly Expressive Speech (2021)8.82