Vec-tok-vc+: Residual-enhanced Robust Zero-shot Voice Conversion With Progressive Constraints In A Dual-mode Training Strategy
2024 Β· Linhan Ma, Xinfa Zhu, Yuanjun Lv, et al.
Abstract
Zero-shot voice conversion (VC) aims to transform source speech into arbitrary unseen target voice while keeping the linguistic content unchanged. Recent VC methods have made significant progress, but semantic losses in the decoupling process as well as training-inference mismatch still hinder conversion performance. In this paper, we propose Vec-Tok-VC+, a novel prompt-based zero-shot VC model improved from Vec-Tok Codec, achieving voice conversion given only a 3s target speaker prompt. We design a residual-enhanced K-Means decoupler to enhance the semantic content extraction with a two-layer clustering process. Besides, we employ teacher-guided refinement to simulate the conversion process to eliminate the training-inference mismatch, forming a dual-mode training strategy. Furthermore, we design a multi-codebook progressive loss function to constrain the layer-wise output of the model from coarse to fine to improve speaker similarity and content accuracy. Objective and subjective eva
Authors
(none)
Tags
Stats
Related papers
- Robust Disentangled Variational Speech Representation Learning For Zero-shot Voice Conversion (2022)10.97
- Takin-vc: Expressive Zero-shot Voice Conversion Via Adaptive Hybrid Content Encoding And Enhanced Timbre Modeling (2024)0.00
- Maskvct: Masked Voice Codec Transformer For Zero-shot Voice Conversion With Increased Controllability Via Multiple Guidances (2025)0.00
- Voicy: Zero-shot Non-parallel Voice Conversion In Noisy Reverberant Environments (2021)5.24
- Zero-shot Voice Conversion Via Self-supervised Prosody Representation Learning (2021)6.34
- Training Robust Zero-shot Voice Conversion Models With Self-supervised Features (2021)7.16
- ACE-VC: Adaptive And Controllable Voice Conversion Using Explicitly Disentangled Self-supervised Speech Representations (2023)0.00
- Voiceprompter: Robust Zero-shot Voice Conversion With Voice Prompt And Conditional Flow Matching (2025)3.58