Building Bilingual And Code-switched Voice Conversion With Limited Training Data Using Embedding Consistency Loss
2021 Β· Yaogen Yang, Haozhe Zhang, Xiaoyi Qin, et al.
Abstract
Building cross-lingual voice conversion (VC) systems for multiple speakers and multiple languages has been a challenging task for a long time. This paper describes a parallel non-autoregressive network to achieve bilingual and code-switched voice conversion for multiple speakers when there are only mono-lingual corpora for each language. We achieve cross-lingual VC between Mandarin speech with multiple speakers and English speech with multiple speakers by applying bilingual bottleneck features. To boost voice cloning performance, we use an adversarial speaker classifier with a gradient reversal layer to reduce the source speaker's information from the output of encoder. Furthermore, in order to improve speaker similarity between reference speech and converted speech, we adopt an embedding consistency loss between the synthesized speech and its natural reference speech in our network. Experimental results show that our proposed method can achieve high quality converted speech with mean
Authors
(none)
Tags
Stats
Related papers
- Building Multi Lingual TTS Using Cross Lingual Voice Conversion (2020)0.00
- Using Joint Training Speaker Encoder With Consistency Loss To Achieve Cross-lingual Voice Conversion And Expressive Voice Conversion (2023)0.00
- Autocycle-vc: Towards Bottleneck-independent Zero-shot Cross-lingual Voice Conversion (2023)0.00
- Transfer Learning From Monolingual ASR To Transcription-free Cross-lingual Voice Conversion (2020)0.00
- Enhancing Polyglot Voices By Leveraging Cross-lingual Fine-tuning In Any-to-one Voice Conversion (2024)0.00
- Voice Conversion Challenge 2020: Intra-lingual Semi-parallel And Cross-lingual Voice Conversion (2020)12.74
- Towards Natural And Controllable Cross-lingual Voice Conversion Based On Neural TTS Model And Phonetic Posteriorgram (2021)0.00
- Learning In Your Voice: Non-parallel Voice Conversion Based On Speaker Consistency Loss (2020)0.00