Adversarially Learning Disentangled Speech Representations For Robust Multi-factor Voice Conversion
2021 Β· Jie Wang, Jingbei Li, Xintao Zhao, et al.
Abstract
Factorizing speech as disentangled speech representations is vital to achieve highly controllable style transfer in voice conversion (VC). Conventional speech representation learning methods in VC only factorize speech as speaker and content, lacking controllability on other prosody-related factors. State-of-the-art speech representation learning methods for more speechfactors are using primary disentangle algorithms such as random resampling and ad-hoc bottleneck layer size adjustment,which however is hard to ensure robust speech representationdisentanglement. To increase the robustness of highly controllable style transfer on multiple factors in VC, we propose a disentangled speech representation learning framework based on adversarial learning. Four speech representations characterizing content, timbre, rhythm and pitch are extracted, and further disentangled by an adversarial Mask-And-Predict (MAP)network inspired by BERT. The adversarial network is used tominimize the correlations
Authors
(none)
Tags
Stats
Related papers
- Speech Representation Disentanglement With Adversarial Mutual Information Learning For One-shot Voice Conversion (2022)11.08
- Robust Disentangled Variational Speech Representation Learning For Zero-shot Voice Conversion (2022)10.97
- Beyond Voice Identity Conversion: Manipulating Voice Attributes By Adversarial Learning Of Structured Disentangled Representations (2021)0.00
- Discrete Unit Based Masking For Improving Disentanglement In Voice Conversion (2024)0.00
- Learning Disentangled Speech Representations With Contrastive Learning And Time-invariant Retrieval (2024)5.84
- Investigation Of Using Disentangled And Interpretable Representations For One-shot Cross-lingual Voice Conversion (2018)6.77
- ACE-VC: Adaptive And Controllable Voice Conversion Using Explicitly Disentangled Self-supervised Speech Representations (2023)0.00
- Disentangled Speech Representation Learning Based On Factorized Hierarchical Variational Autoencoder With Self-supervised Objective (2022)7.81