Maskvct: Masked Voice Codec Transformer For Zero-shot Voice Conversion With Increased Controllability Via Multiple Guidances
2025 Β· Junhyeok Lee, Helin Wang, Yaohan Guan, et al.
Abstract
We introduce MaskVCT, a zero-shot voice conversion (VC) model that offers multi-factor controllability through multiple classifier-free guidances (CFGs). While previous VC models rely on a fixed conditioning scheme, MaskVCT integrates diverse conditions in a single model. To further enhance robustness and control, the model can leverage continuous or quantized linguistic features to enhance intelligibility and speaker similarity, and can use or omit pitch contour to control prosody. These choices allow users to seamlessly balance speaker identity, linguistic content, and prosodic factors in a zero-shot VC setting. Extensive experiments demonstrate that MaskVCT achieves the best target speaker and accent similarities while obtaining competitive word and character error rates compared to existing baselines. Audio samples are available at https://maskvct.github.io/.
Authors
(none)
Tags
Stats
Related papers
- Vec-tok-vc+: Residual-enhanced Robust Zero-shot Voice Conversion With Progressive Constraints In A Dual-mode Training Strategy (2024)3.58
- Codiff-vc: A Codec-assisted Diffusion Model For Zero-shot Voice Conversion (2024)0.00
- Controlvc: Zero-shot Voice Conversion With Time-varying Controls On Pitch And Speed (2022)6.77
- Discrete Unit Based Masking For Improving Disentanglement In Voice Conversion (2024)0.00
- Zero-shot Voice Conversion Via Self-supervised Prosody Representation Learning (2021)6.34
- Diff-hiervc: Diffusion-based Hierarchical Voice Conversion With Robust Pitch Generation And Masked Prior For Zero-shot Speaker Adaptation (2023)0.00
- Takin-vc: Expressive Zero-shot Voice Conversion Via Adaptive Hybrid Content Encoding And Enhanced Timbre Modeling (2024)0.00
- Voicy: Zero-shot Non-parallel Voice Conversion In Noisy Reverberant Environments (2021)5.24