An Initial Investigation Of Language Adaptation For TTS Systems Under Low-resource Scenarios
2024 Β· Cheng Gong, Erica Cooper, Xin Wang, et al.
Abstract
Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system proposed in our previous work. We conducted experiments on 12 languages using limited data with various fine-tuning configurations. We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance. Additionally, we find that the fine-tuning dataset size and number of speakers influence adaptability. Surprisingly, we also observed that using paired data for fine-tuning is not always optimal compared to audio-only data. Beyond speech intelligibility, our analysis covers speaker similarity, language identification, and predicted MOS.
Authors
(none)
Tags
Stats
Related papers
- How To Learn A New Language? An Efficient Solution For Self-supervised Learning Models Unseen Languages Adaption In Low-resource Scenario (2024)0.00
- Cross-lingual Low Resource Speaker Adaptation Using Phonological Features (2021)5.24
- Maximizing Data Efficiency For Cross-lingual TTS Adaptation By Self-supervised Representation Mixing And Embedding Initialization (2024)0.00
- ZMM-TTS: Zero-shot Multilingual And Multispeaker Speech Synthesis Conditioned On Self-supervised Discrete Speech Representations (2023)10.35
- Low-resource Self-supervised Learning With Ssl-enhanced TTS (2023)0.00
- Exploring Efficient-tuning Methods In Self-supervised Speech Models (2022)10.07
- Automatic Data Augmentation For Domain Adapted Fine-tuning Of Self-supervised Speech Representations (2023)0.00
- An Adapter Based Pre-training For Efficient And Scalable Self-supervised Speech Representation Learning (2021)8.35