Voiceprompter: Robust Zero-shot Voice Conversion With Voice Prompt And Conditional Flow Matching
2025 Β· Ha-Yeong Choi, Jaehan Park
Abstract
Despite remarkable advancements in recent voice conversion (VC) systems, enhancing speaker similarity in zero-shot scenarios remains challenging. This challenge arises from the difficulty of generalizing and adapting speaker characteristics in speech within zero-shot environments, which is further complicated by mismatch between the training and inference processes. To address these challenges, we propose VoicePrompter, a robust zero-shot VC model that leverages in-context learning with voice prompts. VoicePrompter is composed of (1) a factorization method that disentangles speech components and (2) a DiT-based conditional flow matching (CFM) decoder that conditions on these factorized features and voice prompts. Additionally, (3) latent mixup is used to enhance in-context learning by combining various speaker features. This approach improves speaker similarity and naturalness in zero-shot VC by applying mixup to latent representations. Experimental results demonstrate that VoicePrompt
Authors
(none)
Tags
Stats
Related papers
- Vec-tok-vc+: Residual-enhanced Robust Zero-shot Voice Conversion With Progressive Constraints In A Dual-mode Training Strategy (2024)3.58
- Zero-shot Voice Conversion Via Self-supervised Prosody Representation Learning (2021)6.34
- Improvement Speaker Similarity For Zero-shot Any-to-any Voice Conversion Of Whispered And Regular Speech (2024)4.52
- Disentangling The Prosody And Semantic Information With Pre-trained Model For In-context Learning Based Zero-shot Voice Conversion (2024)4.52
- Enhancing Expressive Voice Conversion With Discrete Pitch-conditioned Flow Matching Model (2025)5.84
- Voicy: Zero-shot Non-parallel Voice Conversion In Noisy Reverberant Environments (2021)5.24
- Diff-hiervc: Diffusion-based Hierarchical Voice Conversion With Robust Pitch Generation And Masked Prior For Zero-shot Speaker Adaptation (2023)0.00
- Maskvct: Masked Voice Codec Transformer For Zero-shot Voice Conversion With Increased Controllability Via Multiple Guidances (2025)0.00