Perception Of Prosodic Variation For Speech Synthesis Using An Unsupervised Discrete Representation Of F0
2020 Β· Zack Hodari, Catherine Lai, Simon King
Abstract
In English, prosody adds a broad range of information to segment sequences, from information structure (e.g. contrast) to stylistic variation (e.g. expression of emotion). However, when learning to control prosody in text-to-speech voices, it is not clear what exactly the control is modifying. Existing research on discrete representation learning for prosody has demonstrated high naturalness, but no analysis has been performed on what these representations capture, or if they can generate meaningfully-distinct variants of an utterance. We present a phrase-level variational autoencoder with a multi-modal prior, using the mode centres as "intonation codes". Our evaluation establishes which intonation codes are perceptually distinct, finding that the intonation codes from our multi-modal latent model were significantly more distinct than a baseline using k-means clustering. We carry out a follow-up qualitative study to determine what information the codes are carrying. Most commonly, list
Authors
(none)
Tags
Stats
Related papers
- Controllable Speech Synthesis By Learning Discrete Phoneme-level Prosodic Representations (2022)6.34
- Prosodic Clustering For Phoneme-level Prosody Control In End-to-end Speech Synthesis (2021)5.84
- Speech Resynthesis From Discrete Disentangled Self-supervised Representations (2021)16.25
- Using Generative Modelling To Produce Varied Intonation For Speech Synthesis (2019)7.81
- Unsupervised Quantized Prosody Representation For Controllable Speech Synthesis (2022)4.52
- Improved Prosody From Learned F0 Codebook Representations For VQ-VAE Speech Waveform Reconstruction (2020)7.50
- Disentangling Prosody Representations With Unsupervised Speech Reconstruction (2022)0.00
- Learning Utterance-level Representations Through Token-level Acoustic Latents Prediction For Expressive Speech Synthesis (2022)0.00