Parallel Tacotron 2: A Non-autoregressive Neural TTS Model With Differentiable Duration Modeling
2021 Β· Isaac Elias, Heiga Zen, Jonathan Shen, et al.
Abstract
This paper introduces Parallel Tacotron 2, a non-autoregressive neural text-to-speech model with a fully differentiable duration model which does not require supervised duration signals. The duration model is based on a novel attention mechanism and an iterative reconstruction loss based on Soft Dynamic Time Warping, this model can learn token-frame alignments as well as token durations automatically. Experimental results show that Parallel Tacotron 2 outperforms baselines in subjective naturalness in several diverse multi speaker evaluations. Its duration control capability is also demonstrated.
Authors
(none)
Tags
Stats
Related papers
- Non-attentive Tacotron: Robust And Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling (2020)0.00
- Parallel Tacotron: Non-autoregressive And Controllable TTS (2020)12.54
- Autotts: End-to-end Text-to-speech Synthesis Through Differentiable Duration Modeling (2022)0.00
- Non-autoregressive TTS With Explicit Duration Modelling For Low-resource Highly Expressive Speech (2021)8.82
- Using Ipa-based Tacotron For Data Efficient Cross-lingual Speaker Adaptation And Pronunciation Enhancement (2020)0.00
- Deep Voice 2: Multi-speaker Neural Text-to-speech (2017)0.00
- Tacotron: Towards End-to-end Speech Synthesis (2017)0.00
- Regotron: Regularizing The Tacotron2 Architecture Via Monotonic Alignment Loss (2022)5.24