Transfer Learning From Monolingual ASR To Transcription-free Cross-lingual Voice Conversion
2020 Β· Che-Jui Chang
Abstract
Cross-lingual voice conversion (VC) is a task that aims to synthesize target voices with the same content while source and target speakers speak in different languages. Its challenge lies in the fact that the source and target data are naturally non-parallel, and it is even difficult to bridge the gaps between languages with no transcriptions provided. In this paper, we focus on knowledge transfer from monolin-gual ASR to cross-lingual VC, in order to address the con-tent mismatch problem. To achieve this, we first train a monolingual acoustic model for the source language, use it to extract phonetic features for all the speech in the VC dataset, and then train a Seq2Seq conversion model to pre-dict the mel-spectrograms. We successfully address cross-lingual VC without any transcription or language-specific knowledge for foreign speech. We experiment this on Voice Conversion Challenge 2020 datasets and show that our speaker-dependent conversion model outperforms the zero-shot baseline,
Authors
(none)
Tags
Stats
Related papers
- Transfer Learning From Speech Synthesis To Voice Conversion With Non-parallel Training Data (2020)12.74
- Training Robust Zero-shot Voice Conversion Models With Self-supervised Features (2021)7.16
- Voice Conversion Challenge 2020: Intra-lingual Semi-parallel And Cross-lingual Voice Conversion (2020)12.74
- Voice Conversion Can Improve ASR In Very Low-resource Settings (2021)7.50
- Building Multi Lingual TTS Using Cross Lingual Voice Conversion (2020)0.00
- Building Bilingual And Code-switched Voice Conversion With Limited Training Data Using Embedding Consistency Loss (2021)0.00
- VCVTS: Multi-speaker Video-to-speech Synthesis Via Cross-modal Knowledge Transfer From Voice Conversion (2022)6.77
- Fastvc: Fast Voice Conversion With Non-parallel Data (2020)5.24