The THU-HCSI Multi-speaker Multi-lingual Few-shot Voice Cloning System For LIMMITS'24 Challenge
2024 Β· Yixuan Zhou, Shuoyi Zhou, Shun Lei, et al.
Abstract
This paper presents the multi-speaker multi-lingual few-shot voice cloning system developed by THU-HCSI team for LIMMITS'24 Challenge. To achieve high speaker similarity and naturalness in both mono-lingual and cross-lingual scenarios, we build the system upon YourTTS and add several enhancements. For further improving speaker similarity and speech quality, we introduce speaker-aware text encoder and flow-based decoder with Transformer blocks. In addition, we denoise the few-shot data, mix up them with pre-training data, and adopt a speaker-balanced sampling strategy to guarantee effective fine-tuning for target speakers. The official evaluations in track 1 show that our system achieves the best speaker similarity MOS of 4.25 and obtains considerable naturalness MOS of 3.97.
Authors
(none)
Tags
Stats
Related papers
- Cross-lingual Multi-speaker Text-to-speech Synthesis For Voice Cloning Without Using Parallel Corpus For Unseen Speakers (2019)0.00
- Learning To Speak Fluently In A Foreign Language: Multilingual Speech Synthesis And Cross-language Voice Cloning (2019)15.03
- Improve Cross-lingual Voice Cloning Using Low-quality Code-switched Data (2021)0.00
- Investigating On Incorporating Pretrained And Learnable Speaker Representations For Multi-speaker Multi-style Text-to-speech (2021)11.67
- Cross-lingual Text-to-speech Using Multi-task Learning And Speaker Classifier Joint Training (2022)0.00
- Voice Cloning: A Multi-speaker Text-to-speech Synthesis Approach Based On Transfer Learning (2021)0.00
- Latent Linguistic Embedding For Cross-lingual Text-to-speech And Voice Conversion (2020)0.00
- Scaling Nvidia's Multi-speaker Multi-lingual TTS Systems With Zero-shot TTS To Indic Languages (2024)0.00