Phoneme-level BERT For Enhanced Prosody Of Text-to-speech With Grapheme Predictions
2023 Β· Yinghao Aaron Li, Cong Han, Xilin Jiang, et al.
Abstract
Large-scale pre-trained language models have been shown to be helpful in improving the naturalness of text-to-speech (TTS) models by enabling them to produce more naturalistic prosodic patterns. However, these models are usually word-level or sup-phoneme-level and jointly trained with phonemes, making them inefficient for the downstream TTS task where only phonemes are needed. In this work, we propose a phoneme-level BERT (PL-BERT) with a pretext task of predicting the corresponding graphemes along with the regular masked phoneme predictions. Subjective evaluations show that our phoneme-level BERT encoder has significantly improved the mean opinion scores (MOS) of rated naturalness of synthesized speech compared with the state-of-the-art (SOTA) StyleTTS baseline on out-of-distribution (OOD) texts.
Authors
(none)
Tags
Stats
Related papers
- Mixed-phoneme BERT: Improving BERT With Mixed Phoneme And Sup-phoneme Representations For Text To Speech (2022)9.41
- Investigating Content-aware Neural Text-to-speech MOS Prediction Using Prosodic And Linguistic Features (2022)6.34
- Improving Prosody Modelling With Cross-utterance BERT Embeddings For End-to-end Speech Synthesis (2020)10.61
- Clapspeech: Learning Prosody From Text Context With Contrastive Language-audio Pre-training (2023)0.00
- Polyphone Disambiguation And Accent Prediction Using Pre-trained Language Models In Japanese TTS Front-end (2022)5.24
- Cross-dialect Text-to-speech In Pitch-accent Language Incorporating Multi-dialect Phoneme-level BERT (2024)3.58
- Prosodic Representation Learning And Contextual Sampling For Neural Text-to-speech (2020)6.77
- Hignn-tts: Hierarchical Prosody Modeling With Graph Neural Networks For Expressive Long-form TTS (2023)5.84