Limited Data Emotional Voice Conversion Leveraging Text-to-speech: Two-stage Sequence-to-sequence Training
2021 Β· Kun Zhou, Berrak Sisman, Haizhou Li
Abstract
Emotional voice conversion (EVC) aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity. In this paper, we propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data. We note that the proposed EVC framework leverages text-to-speech (TTS) as they share a common goal that is to generate high-quality expressive voice. In stage 1, we perform style initialization with a multi-speaker TTS corpus, to disentangle speaking style and linguistic content. In stage 2, we perform emotion training with a limited amount of emotional speech data, to learn how to disentangle emotional style and linguistic information from the speech. The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
Authors
(none)
Tags
Stats
Related papers
- An Overview & Analysis Of Sequence-to-sequence Emotional Voice Conversion (2022)8.60
- Mixed-evc: Mixed Emotion Synthesis And Control In Voice Conversion (2022)4.52
- Cross-speaker Emotion Transfer For Low-resource Text-to-speech Using Non-parallel Voice Conversion With Pitch-shift Data Augmentation (2022)8.09
- Seen And Unseen Emotional Style Transfer For Voice Conversion With A New Emotional Speech Dataset (2020)16.34
- An Improved Stargan For Emotional Voice Conversion: Enhancing Voice Quality And Data Augmentation (2021)7.81
- Nonparallel Emotional Voice Conversion For Unseen Speaker-emotion Pairs Using Dual Domain Adversarial Network & Virtual Domain Pairing (2023)0.00
- Towards Realistic Emotional Voice Conversion Using Controllable Emotional Intensity (2024)5.84
- Emotional Voice Conversion Using Multitask Learning With Text-to-speech (2019)0.00