Using Previous Acoustic Context To Improve Text-to-speech Synthesis
2020 Β· Pilar Oplustil-Gallegos, Simon King
Abstract
Many speech synthesis datasets, especially those derived from audiobooks, naturally comprise sequences of utterances. Nevertheless, such data are commonly treated as individual, unordered utterances both when training a model and at inference time. This discards important prosodic phenomena above the utterance level. In this paper, we leverage the sequential nature of the data using an acoustic context encoder that produces an embedding of the previous utterance audio. This is input to the decoder in a Tacotron 2 model. The embedding is also used for a secondary task, providing additional supervision. We compare two secondary tasks: predicting the ordering of utterance pairs, and predicting the embedding of the current utterance audio. Results show that the relation between consecutive utterances is informative: our proposed model significantly improves naturalness over a Tacotron 2 baseline.
Authors
(none)
Tags
Stats
Related papers
- Improving Speech Prosody Of Audiobook Text-to-speech Synthesis With Acoustic And Textual Contexts (2022)7.81
- Semi-supervised Training For Improving Data Efficiency In End-to-end Speech Synthesis (2018)13.28
- Leveraging Acoustic Contextual Representation By Audio-textual Cross-modal Learning For Conversational ASR (2022)0.00
- Contextual Joint Factor Acoustic Embeddings (2019)2.26
- Towards End-to-end Prosody Transfer For Expressive Speech Synthesis With Tacotron (2018)0.00
- Improving Sequence-to-sequence Acoustic Modeling By Adding Text-supervision (2018)9.92
- Exploiting Deep Sentential Context For Expressive End-to-end Speech Synthesis (2020)5.84
- Alternate Endings: Improving Prosody For Incremental Neural TTS With Predicted Future Text Input (2021)6.34