Diff-hiervc: Diffusion-based Hierarchical Voice Conversion With Robust Pitch Generation And Masked Prior For Zero-shot Speaker Adaptation
2023 Β· Ha-Yeong Choi, Sang-Hoon Lee, Seong-Whan Lee
Abstract
Although voice conversion (VC) systems have shown a remarkable ability to transfer voice style, existing methods still have an inaccurate pitch and low speaker adaptation quality. To address these challenges, we introduce Diff-HierVC, a hierarchical VC system based on two diffusion models. We first introduce DiffPitch, which can effectively generate F0 with the target voice style. Subsequently, the generated F0 is fed to DiffVoice to convert the speech with a target voice style. Furthermore, using the source-filter encoder, we disentangle the speech and use the converted Mel-spectrogram as a data-driven prior in DiffVoice to improve the voice style transfer capacity. Finally, by using the masked prior in diffusion models, our model can improve the speaker adaptation quality. Experimental results verify the superiority of our model in pitch generation and voice style transfer performance, and our model also achieves a CER of 0.83% and EER of 3.29% in zero-shot VC scenarios.
Authors
(none)
Tags
Stats
Related papers
- Codiff-vc: A Codec-assisted Diffusion Model For Zero-shot Voice Conversion (2024)0.00
- ACE-VC: Adaptive And Controllable Voice Conversion Using Explicitly Disentangled Self-supervised Speech Representations (2023)0.00
- Zero-shot Voice Conversion Via Self-supervised Prosody Representation Learning (2021)6.34
- Maskvct: Masked Voice Codec Transformer For Zero-shot Voice Conversion With Increased Controllability Via Multiple Guidances (2025)0.00
- ZSVC: Zero-shot Style Voice Conversion With Disentangled Latent Diffusion Models And Adversarial Training (2025)0.00
- Improvement Speaker Similarity For Zero-shot Any-to-any Voice Conversion Of Whispered And Regular Speech (2024)4.52
- DDDM-VC: Decoupled Denoising Diffusion Models With Disentangled Representation And Prior Mixup For Verified Robust Voice Conversion (2023)11.29
- Converting Anyone's Voice: End-to-end Expressive Voice Conversion With A Conditional Diffusion Model (2024)5.24