Seen And Unseen Emotional Style Transfer For Voice Conversion With A New Emotional Speech Dataset
2020 Β· Kun Zhou, Berrak Sisman, Rui Liu, et al.
Abstract
Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity. Prior studies show that it is possible to disentangle emotional prosody using an encoder-decoder network conditioned on discrete representation, such as one-hot emotion labels. Such networks learn to remember a fixed set of emotional styles. In this paper, we propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN), which makes use of a pre-trained speech emotion recognition (SER) model to transfer emotional style during training and at run-time inference. In this way, the network is able to transfer both seen and unseen emotional style to a new utterance. We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework. This paper also marks the release of an emotional speech dataset (ESD) for voice conversion, which has multiple speakers an
Authors
(none)
Tags
Stats
Related papers
- Expressive Voice Conversion: A Joint Framework For Speaker Identity And Emotional Style Transfer (2021)9.03
- Converting Anyone's Emotion: Towards Speaker-independent Emotional Voice Conversion (2020)11.39
- VAW-GAN For Disentanglement And Recomposition Of Emotional Elements In Speech (2020)10.74
- Disentanglement Of Emotional Style And Speaker Identity For Expressive Voice Conversion (2021)10.97
- Nonparallel Emotional Speech Conversion (2018)11.08
- Nonparallel Emotional Voice Conversion For Unseen Speaker-emotion Pairs Using Dual Domain Adversarial Network & Virtual Domain Pairing (2023)0.00
- An Improved Stargan For Emotional Voice Conversion: Enhancing Voice Quality And Data Augmentation (2021)7.81
- Limited Data Emotional Voice Conversion Leveraging Text-to-speech: Two-stage Sequence-to-sequence Training (2021)10.35