Emo-stargan: A Semi-supervised Any-to-many Non-parallel Emotion-preserving Voice Conversion
2023 Β· Suhita Ghosh, Arnab Das, Yamini Sinha, et al.
Abstract
Speech anonymisation prevents misuse of spoken data by removing any personal identifier while preserving at least linguistic content. However, emotion preservation is crucial for natural human-computer interaction. The well-known voice conversion technique StarGANv2-VC achieves anonymisation but fails to preserve emotion. This work presents an any-to-many semi-supervised StarGANv2-VC variant trained on partially emotion-labelled non-parallel data. We propose emotion-aware losses computed on the emotion embeddings and acoustic features correlated to emotion. Additionally, we use an emotion classifier to provide direct emotion supervision. Objective and subjective evaluations show that the proposed approach significantly improves emotion preservation over the vanilla StarGANv2-VC. This considerable improvement is seen over diverse datasets, emotions, target speakers, and inter-group conversions without compromising intelligibility and anonymisation.
Authors
(none)
Tags
Stats
Related papers
- Stargan-vc++: Towards Emotion Preserving Voice Conversion Using Deep Embeddings (2023)2.26
- An Improved Stargan For Emotional Voice Conversion: Enhancing Voice Quality And Data Augmentation (2021)7.81
- Stargan-vc: Non-parallel Many-to-many Voice Conversion With Star Generative Adversarial Networks (2018)18.09
- Starganv2-vc: A Diverse, Unsupervised, Non-parallel Framework For Natural-sounding Voice Conversion (2021)13.70
- Expressive Voice Conversion: A Joint Framework For Speaker Identity And Emotional Style Transfer (2021)9.03
- Stargan-vc+asr: Stargan-based Non-parallel Voice Conversion Regularized By Automatic Speech Recognition (2021)5.24
- Converting Anyone's Emotion: Towards Speaker-independent Emotional Voice Conversion (2020)11.39
- Seen And Unseen Emotional Style Transfer For Voice Conversion With A New Emotional Speech Dataset (2020)16.34