EAD-VC: Enhancing Speech Auto-disentanglement For Voice Conversion With IFUB Estimator And Joint Text-guided Consistent Learning
2024 Β· Ziqi Liang, Jianzong Wang, Xulong Zhang, et al.
Abstract
Using unsupervised learning to disentangle speech into content, rhythm, pitch, and timbre for voice conversion has become a hot research topic. Existing works generally take into account disentangling speech components through human-crafted bottleneck features which can not achieve sufficient information disentangling, while pitch and rhythm may still be mixed together. There is a risk of information overlap in the disentangling process which results in less speech naturalness. To overcome such limits, we propose a two-stage model to disentangle speech representations in a self-supervised manner without a human-crafted bottleneck design, which uses the Mutual Information (MI) with the designed upper bound estimator (IFUB) to separate overlapping information between speech components. Moreover, we design a Joint Text-Guided Consistent (TGC) module to guide the extraction of speech content and eliminate timbre leakage issues. Experiments show that our model can achieve a better performan
Authors
(none)
Tags
Stats
Related papers
- VQMIVC: Vector Quantization And Mutual Information-based Unsupervised Speech Representation Disentanglement For One-shot Voice Conversion (2021)20.31
- Learning Disentangled Speech Representations With Contrastive Learning And Time-invariant Retrieval (2024)5.84
- Speech Representation Disentanglement With Adversarial Mutual Information Learning For One-shot Voice Conversion (2022)11.08
- Automatic Speech Disentanglement For Voice Conversion Using Rank Module And Speech Augmentation (2023)4.52
- Robust Disentangled Variational Speech Representation Learning For Zero-shot Voice Conversion (2022)10.97
- Adversarially Learning Disentangled Speech Representations For Robust Multi-factor Voice Conversion (2021)9.92
- Many-to-many Voice Conversion Based Feature Disentanglement Using Variational Autoencoder (2021)7.81
- Unsupervised End-to-end Learning Of Discrete Linguistic Units For Voice Conversion (2019)9.03