Mediumvc: Any-to-any Voice Conversion Using Synthetic Specific-speaker Speeches As Intermedium Features
2021 Β· Yewei Gu, Zhenyu Zhang, Xiaowei Yi, et al.
Abstract
To realize any-to-any (A2A) voice conversion (VC), most methods are to perform symmetric self-supervised reconstruction tasks (Xi to Xi), which usually results in inefficient performances due to inadequate feature decoupling, especially for unseen speakers. We propose a two-stage reconstruction task (Xi to Yi to Xi) using synthetic specific-speaker speeches as intermedium features, where A2A VC is divided into two stages: any-to-one (A2O) and one-to-Any (O2A). In the A2O stage, we propose a new A2O method: SingleVC, by employing a noval data augment strategy(pitch-shifted and duration-remained, PSDR) to accomplish Xi to Yi. In the O2A stage, MediumVC is proposed based on pre-trained SingleVC to conduct Yi to Xi. Through such asymmetrical reconstruction tasks (Xi to Yi in SingleVC and Yi to Xi in MediumVC), the models are to capture robust disentangled features purposefully. Experiments indicate MediumVC can enhance the similarity of converted speeches while maintaining a high degree of
Authors
(none)
Tags
Stats
Related papers
- S2VC: A Framework For Any-to-any Voice Conversion With Self-supervised Pretrained Representations (2021)12.25
- ACE-VC: Adaptive And Controllable Voice Conversion Using Explicitly Disentangled Self-supervised Speech Representations (2023)0.00
- Assem-vc: Realistic Voice Conversion By Assembling Modern Speech Synthesis Techniques (2021)11.64
- Fastvc: Fast Voice Conversion With Non-parallel Data (2020)5.24
- Any-to-one Sequence-to-sequence Voice Conversion Using Self-supervised Discrete Speech Representations (2020)0.00
- Improvement Speaker Similarity For Zero-shot Any-to-any Voice Conversion Of Whispered And Regular Speech (2024)4.52
- VQMIVC: Vector Quantization And Mutual Information-based Unsupervised Speech Representation Disentanglement For One-shot Voice Conversion (2021)20.31
- VCVTS: Multi-speaker Video-to-speech Synthesis Via Cross-modal Knowledge Transfer From Voice Conversion (2022)6.77