Robust Disentangled Variational Speech Representation Learning For Zero-shot Voice Conversion
2022 Β· Jiachen Lian, Chunlei Zhang, Dong Yu
Abstract
Traditional studies on voice conversion (VC) have made progress with parallel training data and known speakers. Good voice conversion quality is obtained by exploring better alignment modules or expressive mapping functions. In this study, we investigate zero-shot VC from a novel perspective of self-supervised disentangled speech representation learning. Specifically, we achieve the disentanglement by balancing the information flow between global speaker representation and time-varying content representation in a sequential variational autoencoder (VAE). A zero-shot voice conversion is performed by feeding an arbitrary speaker embedding and content embeddings to the VAE decoder. Besides that, an on-the-fly data augmentation training strategy is applied to make the learned representation noise invariant. On TIMIT and VCTK datasets, we achieve state-of-the-art performance on both objective evaluation, i.e., speaker verification (SV) on speaker embedding and content embedding, and subject
Authors
(none)
Tags
Stats
Related papers
- Zero-shot Voice Conversion Via Self-supervised Prosody Representation Learning (2021)6.34
- ACE-VC: Adaptive And Controllable Voice Conversion Using Explicitly Disentangled Self-supervised Speech Representations (2023)0.00
- Improving Zero-shot Voice Style Transfer Via Disentangled Representation Learning (2021)0.00
- Training Robust Zero-shot Voice Conversion Models With Self-supervised Features (2021)7.16
- Speech Representation Disentanglement With Adversarial Mutual Information Learning For One-shot Voice Conversion (2022)11.08
- VQMIVC: Vector Quantization And Mutual Information-based Unsupervised Speech Representation Disentanglement For One-shot Voice Conversion (2021)20.31
- Disentangled Speech Representation Learning For One-shot Cross-lingual Voice Conversion Using \(\beta\)-vae (2022)7.50
- QR-VC: Leveraging Quantization Residuals For Linear Disentanglement In Zero-shot Voice Conversion (2024)0.00