Adapting TTS Models For New Speakers Using Transfer Learning
2021 Β· Paarth Neekhara, Jason Li, Boris Ginsburg
Abstract
Training neural text-to-speech (TTS) models for a new speaker typically requires several hours of high quality speech data. Prior works on voice cloning attempt to address this challenge by adapting pre-trained multi-speaker TTS models for a new voice, using a few minutes of speech data of the new speaker. However, publicly available large multi-speaker datasets are often noisy, thereby resulting in TTS models that are not suitable for use in products. We address this challenge by proposing transfer-learning guidelines for adapting high quality single-speaker TTS models for a new speaker, using only a few minutes of speech data. We conduct an extensive study using different amounts of data for a new speaker and evaluate the synthesized speech in terms of naturalness and voice/style similarity to the target speaker. We find that fine-tuning a single-speaker TTS model on just 30 minutes of data, can yield comparable performance to a model trained from scratch on more than 27 hours of dat
Authors
(none)
Tags
Stats
Related papers
- Voice Cloning: A Multi-speaker Text-to-speech Synthesis Approach Based On Transfer Learning (2021)0.00
- Using Ipa-based Tacotron For Data Efficient Cross-lingual Speaker Adaptation And Pronunciation Enhancement (2020)0.00
- Rapid Speaker Adaptation In Low Resource Text To Speech Systems Using Synthetic Data And Transfer Learning (2023)0.00
- Data Efficient Voice Cloning From Noisy Samples With Domain Adversarial Training (2020)9.92
- Sample Efficient Adaptive Text-to-speech (2018)0.00
- Comparative Analysis Of Transfer Learning In Deep Learning Text-to-speech Models On A Few-shot, Low-resource, Customized Dataset (2023)0.00
- Adapter-based Extension Of Multi-speaker Text-to-speech Model For New Speakers (2022)6.77
- Data Efficient Voice Cloning For Neural Singing Synthesis (2019)10.07