Modeling Prosodic Phrasing With Multi-task Learning In Tacotron-based TTS
2020 Β· Rui Liu, Berrak Sisman, Feilong Bao, et al.
Abstract
Tacotron-based end-to-end speech synthesis has shown remarkable voice quality. However, the rendering of prosody in the synthesized speech remains to be improved, especially for long sentences, where prosodic phrasing errors can occur frequently. In this paper, we extend the Tacotron-based speech synthesis framework to explicitly model the prosodic phrase breaks. We propose a multi-task learning scheme for Tacotron training, that optimizes the system to predict both Mel spectrum and phrase breaks. To our best knowledge, this is the first implementation of multi-task learning for Tacotron based TTS with a prosodic phrasing model. Experiments show that our proposed training scheme consistently improves the voice quality for both Chinese and Mongolian systems.
Authors
(none)
Tags
Stats
Related papers
- Towards End-to-end Prosody Transfer For Expressive Speech Synthesis With Tacotron (2018)0.00
- Tacotron: Towards End-to-end Speech Synthesis (2017)0.00
- Fully-hierarchical Fine-grained Prosody Modeling For Interpretable Speech Synthesis (2020)13.28
- Multi-speaker Multi-style Text-to-speech Synthesis With Single-speaker Single-style Training Data Scenarios (2021)6.77
- Non-attentive Tacotron: Robust And Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling (2020)0.00
- An Investigation Of Phrase Break Prediction In An End-to-end TTS System (2023)0.00
- Paratts: Learning Linguistic And Prosodic Cross-sentence Information In Paragraph-based TTS (2022)8.82
- Expressive TTS Training With Frame And Style Reconstruction Loss (2020)12.74