Training Robust Zero-shot Voice Conversion Models With Self-supervised Features
2021 Β· Trung Dang, Dung Tran, Peter Chin, et al.
Abstract
Unsupervised Zero-Shot Voice Conversion (VC) aims to modify the speaker characteristic of an utterance to match an unseen target speaker without relying on parallel training data. Recently, self-supervised learning of speech representation has been shown to produce useful linguistic units without using transcripts, which can be directly passed to a VC model. In this paper, we showed that high-quality audio samples can be achieved by using a length resampling decoder, which enables the VC model to work in conjunction with different linguistic feature extractors and vocoders without requiring them to operate on the same sequence length. We showed that our method can outperform many baselines on the VCTK dataset. Without modifying the architecture, we further demonstrated that a) using pairs of different audio segments from the same speaker, b) adding a cycle consistency loss, and c) adding a speaker classification loss can help to learn a better speaker embedding. Our model trained on Li
Authors
(none)
Tags
Stats
Related papers
- Robust Disentangled Variational Speech Representation Learning For Zero-shot Voice Conversion (2022)10.97
- Zero-shot Voice Conversion Via Self-supervised Prosody Representation Learning (2021)6.34
- ACE-VC: Adaptive And Controllable Voice Conversion Using Explicitly Disentangled Self-supervised Speech Representations (2023)0.00
- Convoice: Real-time Zero-shot Voice Style Transfer With Convolutional Network (2020)0.00
- End-to-end Zero-shot Voice Conversion With Location-variable Convolutions (2022)7.50
- Vec-tok-vc+: Residual-enhanced Robust Zero-shot Voice Conversion With Progressive Constraints In A Dual-mode Training Strategy (2024)3.58
- SEF-VC: Speaker Embedding Free Zero-shot Voice Conversion With Cross Attention (2023)0.00
- Selfvc: Voice Conversion With Iterative Refinement Using Self Transformations (2023)0.00