Transfer Learning From Speech Synthesis To Voice Conversion With Non-parallel Training Data
2020 Β· Mingyang Zhang, Yi Zhou, Li Zhao, et al.
Abstract
This paper presents a novel framework to build a voice conversion (VC) system by learning from a text-to-speech (TTS) synthesis system, that is called TTS-VC transfer learning. We first develop a multi-speaker speech synthesis system with sequence-to-sequence encoder-decoder architecture, where the encoder extracts robust linguistic representations of text, and the decoder, conditioned on target speaker embedding, takes the context vectors and the attention recurrent network cell output to generate target acoustic features. We take advantage of the fact that TTS system maps input text to speaker independent context vectors, and reuse such a mapping to supervise the training of latent representations of an encoder-decoder voice conversion system. In the voice conversion system, the encoder takes speech instead of text as input, while the decoder is functionally similar to TTS decoder. As we condition the decoder on speaker embedding, the system can be trained on non-parallel data for an
Authors
(none)
Tags
Stats
Related papers
- Bootstrapping Non-parallel Voice Conversion From Speaker-adaptive Text-to-speech (2019)8.35
- Transfer The Linguistic Representations From TTS To Accent Conversion With Non-parallel Data (2024)6.77
- Transfer Learning From Monolingual ASR To Transcription-free Cross-lingual Voice Conversion (2020)0.00
- Voice Transformer Network: Sequence-to-sequence Voice Conversion Using Transformer With Text-to-speech Pretraining (2019)13.17
- Emotional Voice Conversion Using Multitask Learning With Text-to-speech (2019)0.00
- VCVTS: Multi-speaker Video-to-speech Synthesis Via Cross-modal Knowledge Transfer From Voice Conversion (2022)6.77
- Fastvc: Fast Voice Conversion With Non-parallel Data (2020)5.24
- Voice Conversion From Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks (2017)16.34