Residual Speaker Representation For One-shot Voice Conversion
2023 Β· Le Xu, Jiangyan Yi, Tao Wang, et al.
Abstract
Recently, there have been significant advancements in voice conversion, resulting in high-quality performance. However, there are still two critical challenges in this field. Firstly, current voice conversion methods have limited robustness when encountering unseen speakers. Secondly, they also have limited ability to control timbre representation. To address these challenges, this paper presents a novel approach that leverages tokens of multi-layer residual approximations to enhance robustness when dealing with unseen speakers, called the residual speaker module. Introducing multi-layer approximations facilitates the separation of information from the timbre, enabling effective control over timbre in voice conversion. The proposed method outperforms baselines in subjective and objective evaluations, demonstrating superior performance and increased robustness. Our demo page is publicly available.
Authors
(none)
Tags
Stats
Related papers
- Improving Robustness Of One-shot Voice Conversion With Deep Discriminative Speaker Encoder (2021)5.84
- An Overview Of Voice Conversion And Its Challenges: From Statistical Modeling To Deep Learning (2020)18.53
- Improvement Speaker Similarity For Zero-shot Any-to-any Voice Conversion Of Whispered And Regular Speech (2024)4.52
- ACE-VC: Adaptive And Controllable Voice Conversion Using Explicitly Disentangled Self-supervised Speech Representations (2023)0.00
- Vec-tok-vc+: Residual-enhanced Robust Zero-shot Voice Conversion With Progressive Constraints In A Dual-mode Training Strategy (2024)3.58
- Speech Representation Disentanglement With Adversarial Mutual Information Learning For One-shot Voice Conversion (2022)11.08
- Learning In Your Voice: Non-parallel Voice Conversion Based On Speaker Consistency Loss (2020)0.00
- One-shot Voice Conversion For Style Transfer Based On Speaker Adaptation (2021)8.09