Towards Transfer Learning For End-to-end Speech Synthesis From Deep Pre-trained Language Models
2019 Β· Wei Fang, Yu-An Chung, James Glass
Abstract
Modern text-to-speech (TTS) systems are able to generate audio that sounds almost as natural as human speech. However, the bar of developing high-quality TTS systems remains high since a sizable set of studio-quality <text, audio> pairs is usually required. Compared to commercial data used to develop state-of-the-art systems, publicly available data are usually worse in terms of both quality and size. Audio generated by TTS systems trained on publicly available data tends to not only sound less natural, but also exhibits more background noise. In this work, we aim to lower TTS systems' reliance on high-quality data by providing them the textual knowledge extracted by deep pre-trained language models during training. In particular, we investigate the use of BERT to assist the training of Tacotron-2, a state of the art TTS consisting of an encoder and an attention-based decoder. BERT representations learned from large amounts of unlabeled text data are shown to contain very rich semantic
Authors
(none)
Tags
Stats
Related papers
- End-to-end Text-to-speech For Low-resource Languages By Cross-lingual Transfer Learning (2019)0.00
- Training Text-to-speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks (2022)7.16
- Semi-supervised Training For Improving Data Efficiency In End-to-end Speech Synthesis (2018)13.28
- Knowledge Transfer From Large-scale Pretrained Language Models To End-to-end Speech Recognizers (2022)9.41
- Leveraging Weakly Supervised Data To Improve End-to-end Speech-to-text Translation (2018)13.05
- Comparative Analysis Of Transfer Learning In Deep Learning Text-to-speech Models On A Few-shot, Low-resource, Customized Dataset (2023)0.00
- Rapid Speaker Adaptation In Low Resource Text To Speech Systems Using Synthetic Data And Transfer Learning (2023)0.00
- Transfer Learning Framework For Low-resource Text-to-speech Using A Large-scale Unlabeled Speech Corpus (2022)10.21