Abstract

We present a neural text-to-speech system for fine-grained prosody transfer from one speaker to another. Conventional approaches for end-to-end prosody transfer typically use either fixed-dimensional or variable-length prosody embedding via a secondary attention to encode the reference signal. However, when trained on a single-speaker dataset, the conventional prosody transfer systems are not robust enough to speaker variability, especially in the case of a reference signal coming from an unseen speaker. Therefore, we propose decoupling of the reference signal alignment from the overall system. For this purpose, we pre-compute phoneme-level time stamps and use them to aggregate prosodic features per phoneme, injecting them into a sequence-to-sequence text-to-speech system. We incorporate a variational auto-encoder to further enhance the latent representation of prosody embeddings. We show that our proposed approach is significantly more stable and achieves reliable prosody transplantat

Authors

(none)

Tags

  • Text-to-Speech
  • Speaker Analysis

Stats

  • citations0
  • S2 citations
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keyklimkov2019fine

Related papers

Fine-grained Robust Prosody Transfer For Single-speaker Neural Text-to-speech — speech-audio