Bootstrapping Non-parallel Voice Conversion From Speaker-adaptive Text-to-speech
2019 Β· Hieu-Thi Luong, Junichi Yamagishi
Abstract
Voice conversion (VC) and text-to-speech (TTS) are two tasks that share a similar objective, generating speech with a target voice. However, they are usually developed independently under vastly different frameworks. In this paper, we propose a methodology to bootstrap a VC system from a pretrained speaker-adaptive TTS model and unify the techniques as well as the interpretations of these two tasks. Moreover by offloading the heavy data demand to the training stage of the TTS model, our VC system can be built using a small amount of target speaker speech data. It also opens up the possibility of using speech in a foreign unseen language to build the system. Our subjective evaluations show that the proposed framework is able to not only achieve competitive performance in the standard intra-language scenario but also adapt and convert using speech utterances in an unseen language.
Authors
(none)
Tags
Stats
Related papers
- Transfer Learning From Speech Synthesis To Voice Conversion With Non-parallel Training Data (2020)12.74
- Voice Filter: Few-shot Text-to-speech Speaker Adaptation Using Voice Conversion As A Post-processing Module (2022)8.35
- Emotional Voice Conversion Using Multitask Learning With Text-to-speech (2019)0.00
- Cross-lingual Knowledge Distillation Via Flow-based Voice Conversion For Robust Polyglot Text-to-speech (2023)0.00
- Voice Transformer Network: Sequence-to-sequence Voice Conversion Using Transformer With Text-to-speech Pretraining (2019)13.17
- Cross-speaker Emotion Transfer For Low-resource Text-to-speech Using Non-parallel Voice Conversion With Pitch-shift Data Augmentation (2022)8.09
- Transfer The Linguistic Representations From TTS To Accent Conversion With Non-parallel Data (2024)6.77
- Tts-guided Training For Accent Conversion Without Parallel Data (2022)8.60