Multilingual Byte2speech Models For Scalable Low-resource Speech Synthesis
2021 Β· Mutian He, Jingzhou Yang, Lei He, et al.
Abstract
To scale neural speech synthesis to various real-world languages, we present a multilingual end-to-end framework that maps byte inputs to spectrograms, thus allowing arbitrary input scripts. Besides strong results on 40+ languages, the framework demonstrates capabilities to adapt to new languages under extreme low-resource and even few-shot scenarios of merely 40s transcribed recording, without the need of per-language resources like lexicon, extra corpus, auxiliary models, or linguistic expertise, thus ensuring scalability. While it retains satisfactory intelligibility and naturalness matching rich-resource models. Exhaustive comparative and ablation studies are performed to reveal the potential of the framework for low-resource languages. Furthermore, we propose a novel method to extract language-specific sub-networks in a multilingual model for a better understanding of its mechanism.
Authors
(none)
Tags
Stats
Related papers
- Bytes Are All You Need: End-to-end Multilingual Speech Recognition And Synthesis With Bytes (2018)14.15
- Efficient Neural Speech Synthesis For Low-resource Languages Through Multilingual Modeling (2020)8.60
- Leveraging Translations For Speech Transcription In Low-resource Settings (2018)6.77
- Non-linear Pairwise Language Mappings For Low-resource Multilingual Acoustic Model Fusion (2022)0.00
- Multilingual End-to-end Speech Translation (2019)0.00
- Cross-lingual Low Resource Speaker Adaptation Using Phonological Features (2021)5.24
- Allost: Low-resource Speech Translation Without Source Transcription (2021)7.81
- One-to-many Multilingual End-to-end Speech Translation (2019)9.23