Duta-vc: A Duration-aware Typical-to-atypical Voice Conversion Approach With Diffusion Probabilistic Model
2023 Β· Helin Wang, Thomas Thebaud, Jesus Villalba, et al.
Abstract
We present a novel typical-to-atypical voice conversion approach (DuTa-VC), which (i) can be trained with nonparallel data (ii) first introduces diffusion probabilistic model (iii) preserves the target speaker identity (iv) is aware of the phoneme duration of the target speaker. DuTa-VC consists of three parts: an encoder transforms the source mel-spectrogram into a duration-modified speaker-independent mel-spectrogram, a decoder performs the reverse diffusion to generate the target mel-spectrogram, and a vocoder is applied to reconstruct the waveform. Objective evaluations conducted on the UASpeech show that DuTa-VC is able to capture severity characteristics of dysarthric speech, reserves speaker identity, and significantly improves dysarthric speech recognition as a data augmentation. Subjective evaluations by two expert speech pathologists validate that DuTa-VC can preserve the severity and type of dysarthria of the target speakers in the synthesized speech.
Authors
(none)
Tags
Stats
Related papers
- DDDM-VC: Decoupled Denoising Diffusion Models With Disentangled Representation And Prior Mixup For Verified Robust Voice Conversion (2023)11.29
- Converting Anyone's Voice: End-to-end Expressive Voice Conversion With A Conditional Diffusion Model (2024)5.24
- Learning Explicit Prosody Models And Deep Speaker Embeddings For Atypical Voice Conversion (2020)7.16
- Diff-hiervc: Diffusion-based Hierarchical Voice Conversion With Robust Pitch Generation And Masked Prior For Zero-shot Speaker Adaptation (2023)0.00
- PMVC: Data Augmentation-based Prosody Modeling For Expressive Voice Conversion (2023)9.23
- ZSVC: Zero-shot Style Voice Conversion With Disentangled Latent Diffusion Models And Adversarial Training (2025)0.00
- Fastvoicegrad: One-step Diffusion-based Voice Conversion With Adversarial Conditional Diffusion Distillation (2024)4.52
- Highly Controllable Diffusion-based Any-to-any Voice Conversion Model With Frame-level Prosody Feature (2023)0.00