Transforming Spectrum And Prosody For Emotional Voice Conversion With Non-parallel Training Data
2020 Β· Kun Zhou, Berrak Sisman, Haizhou Li
Abstract
Emotional voice conversion aims to convert the spectrum and prosody to change the emotional patterns of speech, while preserving the speaker identity and linguistic content. Many studies require parallel speech data between different emotional patterns, which is not practical in real life. Moreover, they often model the conversion of fundamental frequency (F0) with a simple linear transform. As F0 is a key aspect of intonation that is hierarchical in nature, we believe that it is more adequate to model F0 in different temporal scales by using wavelet transform. We propose a CycleGAN network to find an optimal pseudo pair from non-parallel training data by learning forward and inverse mappings simultaneously using adversarial and cycle-consistency losses. We also study the use of continuous wavelet transform (CWT) to decompose F0 into ten temporal scales, that describes speech prosody at different time resolution, for effective F0 conversion. Experimental results show that our proposed
Authors
(none)
Tags
Stats
Related papers
- Converting Anyone's Emotion: Towards Speaker-independent Emotional Voice Conversion (2020)11.39
- Spectrum And Prosody Conversion For Cross-lingual Voice Conversion With Cyclegan (2020)0.00
- Towards End-to-end F0 Voice Conversion Based On Dual-gan With Convolutional Wavelet Kernels (2021)5.84
- Non-parallel Emotion Conversion Using A Deep-generative Hybrid Network And An Adversarial Pair Discriminator (2020)6.77
- Seen And Unseen Emotional Style Transfer For Voice Conversion With A New Emotional Speech Dataset (2020)16.34
- Nonparallel Emotional Speech Conversion (2018)11.08
- VAW-GAN For Disentanglement And Recomposition Of Emotional Elements In Speech (2020)10.74
- A Diffeomorphic Flow-based Variational Framework For Multi-speaker Emotion Conversion (2022)2.26