Investigation Of Japanese Png BERT Language Model In Text-to-speech Synthesis For Pitch Accent Language
2022 Β· Yusuke Yasuda, Tomoki Toda
Abstract
End-to-end text-to-speech synthesis (TTS) can generate highly natural synthetic speech from raw text. However, rendering the correct pitch accents is still a challenging problem for end-to-end TTS. To tackle the challenge of rendering correct pitch accent in Japanese end-to-end TTS, we adopt PnG~BERT, a self-supervised pretrained model in the character and phoneme domain for TTS. We investigate the effects of features captured by PnG~BERT on Japanese TTS by modifying the fine-tuning condition to determine the conditions helpful inferring pitch accents. We manipulate content of PnG~BERT features from being text-oriented to speech-oriented by changing the number of fine-tuned layers during TTS. In addition, we teach PnG~BERT pitch accent information by fine-tuning with tone prediction as an additional downstream task. Our experimental results show that the features of PnG~BERT captured by pretraining contain information helpful inferring pitch accent, and PnG~BERT outperforms baseline Ta
Authors
(none)
Tags
Stats
Related papers
- Polyphone Disambiguation And Accent Prediction Using Pre-trained Language Models In Japanese TTS Front-end (2022)5.24
- Investigation Of Enhanced Tacotron Text-to-speech Synthesis Systems With Self-attention For Pitch Accent Language (2018)12.54
- Cross-dialect Text-to-speech In Pitch-accent Language Incorporating Multi-dialect Phoneme-level BERT (2024)3.58
- Investigation Of Learning Abilities On Linguistic Features In Sequence-to-sequence Text-to-speech Synthesis (2020)8.82
- Investigating Accuracy Of Pitch-accent Annotations In Neural Network-based Speech Synthesis And Denoising Effects (2018)7.81
- Phoneme-level BERT For Enhanced Prosody Of Text-to-speech With Grapheme Predictions (2023)0.00
- Disambiguation Of Chinese Polyphones In An End-to-end Framework With Semantic Features Extracted By Pre-trained BERT (2025)7.16
- BERT, Can HE Predict Contrastive Focus? Predicting And Controlling Prominence In Neural TTS Using A Language Model (2022)5.24