Pureformer-vc: Non-parallel One-shot Voice Conversion With Pure Transformer Blocks And Triplet Discriminative Training
2024 Β· Wenhan Yao, Zedong Xing, Xiarun Chen, et al.
Abstract
One-shot voice conversion(VC) aims to change the timbre of any source speech to match that of the target speaker with only one speech sample. Existing style transfer-based VC methods relied on speech representation disentanglement and suffered from accurately and independently encoding each speech component and recomposing back to converted speech effectively. To tackle this, we proposed Pureformer-VC, which utilizes Conformer blocks to build a disentangled encoder, and Zipformer blocks to build a style transfer decoder as the generator. In the decoder, we used effective styleformer blocks to integrate speaker characteristics effectively into the generated speech. The models used the generative VAE loss for encoding components and triplet loss for unsupervised discriminative training. We applied the styleformer method to Zipformer's shared weights for style transfer. The experimental results show that the proposed model achieves comparable subjective scores and exhibits improvements in
Authors
(none)
Tags
Stats
Related papers
- One-shot Voice Conversion For Style Transfer Based On Speaker Adaptation (2021)8.09
- AUTOVC: Zero-shot Voice Style Transfer With Only Autoencoder Loss (2019)0.00
- AVQVC: One-shot Voice Conversion By Vector Quantization With Applying Contrastive Learning (2022)12.40
- Zero-shot Voice Conversion Via Self-supervised Prosody Representation Learning (2021)6.34
- Fastvc: Fast Voice Conversion With Non-parallel Data (2020)5.24
- Stablevc: Style Controllable Zero-shot Voice Conversion With Conditional Flow Matching (2024)7.81
- ZSVC: Zero-shot Style Voice Conversion With Disentangled Latent Diffusion Models And Adversarial Training (2025)0.00
- Vec-tok-vc+: Residual-enhanced Robust Zero-shot Voice Conversion With Progressive Constraints In A Dual-mode Training Strategy (2024)3.58