EMOCONV-DIFF: Diffusion-based Speech Emotion Conversion For Non-parallel And In-the-wild Data
2023 Β· Navin Raj Prabhu, Bunlong Lay, Simon Welker, et al.
Abstract
Speech emotion conversion is the task of converting the expressed emotion of a spoken utterance to a target emotion while preserving the lexical content and speaker identity. While most existing works in speech emotion conversion rely on acted-out datasets and parallel data samples, in this work we specifically focus on more challenging in-the-wild scenarios and do not rely on parallel data. To this end, we propose a diffusion-based generative model for speech emotion conversion, the EmoConv-Diff, that is trained to reconstruct an input utterance while also conditioning on its emotion. Subsequently, at inference, a target emotion embedding is employed to convert the emotion of the input utterance to the given target emotion. As opposed to performing emotion conversion on categorical representations, we use a continuous arousal dimension to represent emotions while also achieving intensity control. We validate the proposed methodology on a large in-the-wild dataset, the MSP-Podcast v1.1
Authors
(none)
Tags
Stats
Related papers
- In-the-wild Speech Emotion Conversion Using Disentangled Self-supervised Representations And Neural Vocoder-based Resynthesis (2023)0.00
- Nonparallel Emotional Speech Conversion (2018)11.08
- Emomix: Emotion Mixing Via Diffusion Models For Emotional Speech Synthesis (2023)0.00
- Textless Speech Emotion Conversion Using Discrete And Decomposed Representations (2021)10.74
- Emoreg: Directional Latent Vector Modeling For Emotional Intensity Regularization In Diffusion-based Voice Conversion (2024)2.26
- Seen And Unseen Emotional Style Transfer For Voice Conversion With A New Emotional Speech Dataset (2020)16.34
- Converting Anyone's Emotion: Towards Speaker-independent Emotional Voice Conversion (2020)11.39
- Emodiff: Intensity Controllable Emotional Text-to-speech With Soft-label Guidance (2022)0.00