Whispervc: Decoupled Cross-domain Alignment And Speech Generation For Low-resource Whisper-to-normal Conversion
2025 Β· Dong Liu, Juan Liu, Wei Ju, et al.
Abstract
Whispered speech lacks vocal-fold excitation, making intelligible conversion challenging. We propose WhisperVC, a three-stage framework for low-resource whisper-to-normal (W2N) conversion that decouples cross-domain alignment from speech generation. Stage 1 uses limited paired whisper-normal data with a content encoder and a Conformer-based variational autoencoder (VAE) with soft-DTW alignment to learn domain-invariant semantic representations. Stage 2, trained only on normal speech, employs a Length-Channel Aligner and a two-stage speaker-conditioned mel generator for timbre and prosody modeling. Stage 3 fine-tunes a HiFi-GAN vocoder for waveform synthesis. Experimental results on AISHELL6-Whisper show competitive quality (DNSMOS 3.07, UTMOS 2.83, CER 16.93%) and WavLM speaker similarity (0.95). The framework also supports privacy-preserving communication as well as non-vocal communication and a rehabilitation tool for post-surgical vocal-fold patients. Samples are available online.
Authors
(none)
Tags
Stats
Related papers
- Vocoder-free Non-parallel Conversion Of Whispered Speech With Masked Cycle-consistent Generative Adversarial Networks (2023)0.00
- Attention-guided Generative Adversarial Network For Whisper To Normal Speech Conversion (2021)5.84
- Generative Models For Improved Naturalness, Intelligibility, And Voicing Of Whispered Speech (2022)6.34
- Improvement Speaker Similarity For Zero-shot Any-to-any Voice Conversion Of Whispered And Regular Speech (2024)4.52
- Vec2wav 2.0: Advancing Voice Conversion Via Discrete Token Vocoders (2024)0.00
- Whisper In Focus: Enhancing Stuttered Speech Classification With Encoder Layer Optimization (2023)0.00
- Whispered-to-voiced Alaryngeal Speech Conversion With Generative Adversarial Networks (2018)9.41
- End-to-end Whisper To Natural Speech Conversion Using Modified Transformer Network (2020)0.00