PAVITS: Exploring Prosody-aware VITS For End-to-end Emotional Voice Conversion
2024 Β· Tianhua Qi, Wenming Zheng, Cheng Lu, et al.
Abstract
In this paper, we propose Prosody-aware VITS (PAVITS) for emotional voice conversion (EVC), aiming to achieve two major objectives of EVC: high content naturalness and high emotional naturalness, which are crucial for meeting the demands of human perception. To improve the content naturalness of converted audio, we have developed an end-to-end EVC architecture inspired by the high audio quality of VITS. By seamlessly integrating an acoustic converter and vocoder, we effectively address the common issue of mismatch between emotional prosody training and run-time conversion that is prevalent in existing EVC models. To further enhance the emotional naturalness, we introduce an emotion descriptor to model the subtle prosody variations of different speech emotions. Additionally, we propose a prosody predictor, which predicts prosody features from text based on the provided emotion label. Notably, we introduce a prosody alignment loss to establish a connection between latent prosody features
Authors
(none)
Tags
Stats
Related papers
- Mixed-evc: Mixed Emotion Synthesis And Control In Voice Conversion (2022)4.52
- Towards Realistic Emotional Voice Conversion Using Controllable Emotional Intensity (2024)5.84
- Limited Data Emotional Voice Conversion Leveraging Text-to-speech: Two-stage Sequence-to-sequence Training (2021)10.35
- PMVC: Data Augmentation-based Prosody Modeling For Expressive Voice Conversion (2023)9.23
- An Overview & Analysis Of Sequence-to-sequence Emotional Voice Conversion (2022)8.60
- Period VITS: Variational Inference With Explicit Pitch Modeling For End-to-end Emotional Speech Synthesis (2022)8.60
- VAW-GAN For Disentanglement And Recomposition Of Emotional Elements In Speech (2020)10.74
- Emoreg: Directional Latent Vector Modeling For Emotional Intensity Regularization In Diffusion-based Voice Conversion (2024)2.26