Attention-based Interactive Disentangling Network For Instance-level Emotional Voice Conversion
2023 Β· Yun Chen, Lingxiao Yang, Qi Chen, et al.
Abstract
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components. Existing approaches cannot well express fine-grained emotional attributes. In this paper, we propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion. We introduce a two-stage pipeline to effectively train our network: Stage I utilizes inter-speech contrastive learning to model fine-grained emotion and intra-speech disentanglement learning to better separate emotion and content. In Stage II, we propose to regularize the conversion with a multi-view consistency mechanism. This technique helps us transfer fine-grained emotion and maintain speech content. Extensive experiments show that our AINN outperforms state-of-the-arts in both objective and subjective metrics.
Authors
(none)
Tags
Stats
Related papers
- Towards Realistic Emotional Voice Conversion Using Controllable Emotional Intensity (2024)5.84
- Seen And Unseen Emotional Style Transfer For Voice Conversion With A New Emotional Speech Dataset (2020)16.34
- In-the-wild Speech Emotion Conversion Using Disentangled Self-supervised Representations And Neural Vocoder-based Resynthesis (2023)0.00
- Limited Data Emotional Voice Conversion Leveraging Text-to-speech: Two-stage Sequence-to-sequence Training (2021)10.35
- Converting Anyone's Emotion: Towards Speaker-independent Emotional Voice Conversion (2020)11.39
- Nonparallel Emotional Speech Conversion (2018)11.08
- Expressive-vc: Highly Expressive Voice Conversion With Attention Fusion Of Bottleneck And Perturbation Features (2022)9.03
- Mixed-evc: Mixed Emotion Synthesis And Control In Voice Conversion (2022)4.52