Mulantts: The Microsoft Speech Synthesis System For Blizzard Challenge 2023
2023 Β· Zhihang Xu, Shaofei Zhang, Xi Wang, et al.
Abstract
In this paper, we present MuLanTTS, the Microsoft end-to-end neural text-to-speech (TTS) system designed for the Blizzard Challenge 2023. About 50 hours of audiobook corpus for French TTS as hub task and another 2 hours of speaker adaptation as spoke task are released to build synthesized voices for different test purposes including sentences, paragraphs, homographs, lists, etc. Building upon DelightfulTTS, we adopt contextual and emotion encoders to adapt the audiobook data to enrich beyond sentences for long-form prosody and dialogue expressiveness. Regarding the recording quality, we also apply denoise algorithms and long audio processing for both corpora. For the hub task, only the 50-hour single speaker data is used for building the TTS system, while for the spoke task, a multi-speaker source model is used for target speaker fine tuning. MuLanTTS achieves mean scores of quality assessment 4.3 and 4.5 in the respective tasks, statistically comparable with natural speech while keepi
Authors
(none)
Tags
Stats
Related papers
- Delightfultts: The Microsoft Speech Synthesis System For Blizzard Challenge 2021 (2021)10.21
- The THU-HCSI Multi-speaker Multi-lingual Few-shot Voice Cloning System For LIMMITS'24 Challenge (2024)0.00
- Scaling Nvidia's Multi-speaker Multi-lingual TTS Systems With Zero-shot TTS To Indic Languages (2024)0.00
- Towards Natural Bilingual And Code-switched Speech Synthesis Based On Mix Of Monolingual Recordings And Cross-lingual Voice Conversion (2020)0.00
- ZMM-TTS: Zero-shot Multilingual And Multispeaker Speech Synthesis Conditioned On Self-supervised Discrete Speech Representations (2023)10.35
- The X-LANCE Technical Report For Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge (2024)0.00
- Wenetspeech4tts: A 12,800-hour Mandarin TTS Corpus For Large Speech Generation Model Benchmark (2024)9.76
- Building A Mixed-lingual Neural TTS System With Only Monolingual Data (2019)0.00