Cotatron: Transcription-guided Speech Encoder For Any-to-many Voice Conversion Without Parallel Data
2020 Β· Seung-Won Park, Doo-Young Kim, Myun-Chul Joe
Abstract
We propose Cotatron, a transcription-guided speech encoder for speaker-independent linguistic representation. Cotatron is based on the multispeaker TTS architecture and can be trained with conventional TTS datasets. We train a voice conversion system to reconstruct speech with Cotatron features, which is similar to the previous methods based on Phonetic Posteriorgram (PPG). By training and evaluating our system with 108 speakers from the VCTK dataset, we outperform the previous method in terms of both naturalness and speaker similarity. Our system can also convert speech from speakers that are unseen during training, and utilize ASR to automate the transcription with minimal reduction of the performance. Audio samples are available at https://mindslab-ai.github.io/cotatron, and the code with a pre-trained model will be made available soon.
Authors
(none)
Tags
Stats
Related papers
- Assem-vc: Realistic Voice Conversion By Assembling Modern Speech Synthesis Techniques (2021)11.64
- Taco-vc: A Single Speaker Tacotron Based Voice Conversion With Limited Data (2019)5.24
- Building Multi Lingual TTS Using Cross Lingual Voice Conversion (2020)0.00
- AC-VC: Non-parallel Low Latency Phonetic Posteriorgrams Based Voice Conversion (2021)7.50
- Copycat2: A Single Model For Multi-speaker TTS And Many-to-many Fine-grained Prosody Transfer (2022)5.24
- Copycat: Many-to-many Fine-grained Prosody Transfer For Neural Text-to-speech (2020)11.76
- Translatotron 2: High-quality Direct Speech-to-speech Translation With Voice Preservation (2021)0.00
- Bootstrapping Non-parallel Voice Conversion From Speaker-adaptive Text-to-speech (2019)8.35