Speech Representation Disentanglement With Adversarial Mutual Information Learning For One-shot Voice Conversion
2022 Β· Sicheng Yang, Methawee Tantrawenith, Haolin Zhuang, et al.
Abstract
One-shot voice conversion (VC) with only a single target speaker's speech for reference has become a hot research topic. Existing works generally disentangle timbre, while information about pitch, rhythm and content is still mixed together. To perform one-shot VC effectively with further disentangling these speech components, we employ random resampling for pitch and content encoder and use the variational contrastive log-ratio upper bound of mutual information and gradient reversal layer based adversarial mutual information learning to ensure the different parts of the latent space containing only the desired disentangled representation during training. Experiments on the VCTK dataset show the model achieves state-of-the-art performance for one-shot VC in terms of naturalness and intellgibility. In addition, we can transfer characteristics of one-shot VC on timbre, pitch and rhythm separately by speech representation disentanglement. Our code, pre-trained models and demo are available
Authors
(none)
Tags
Stats
Related papers
- VQMIVC: Vector Quantization And Mutual Information-based Unsupervised Speech Representation Disentanglement For One-shot Voice Conversion (2021)20.31
- Robust Disentangled Variational Speech Representation Learning For Zero-shot Voice Conversion (2022)10.97
- MAIN-VC: Lightweight Speech Representation Disentanglement For One-shot Voice Conversion (2024)3.58
- Zero-shot Voice Conversion Via Self-supervised Prosody Representation Learning (2021)6.34
- Learning Disentangled Speech Representations With Contrastive Learning And Time-invariant Retrieval (2024)5.84
- One-shot Voice Conversion By Separating Speaker And Content Representations With Instance Normalization (2019)0.00
- Disentangled Speech Representation Learning For One-shot Cross-lingual Voice Conversion Using \(\beta\)-vae (2022)7.50
- ACE-VC: Adaptive And Controllable Voice Conversion Using Explicitly Disentangled Self-supervised Speech Representations (2023)0.00