Stylespeech: Self-supervised Style Enhancing With Vq-vae-based Pre-training For Expressive Audiobook Speech Synthesis
2023 Β· Xueyuan Chen, Xi Wang, Shaofei Zhang, et al.
Abstract
The expressive quality of synthesized speech for audiobooks is limited by generalized model architecture and unbalanced style distribution in the training data. To address these issues, in this paper, we propose a self-supervised style enhancing method with VQ-VAE-based pre-training for expressive audiobook speech synthesis. Firstly, a text style encoder is pre-trained with a large amount of unlabeled text-only data. Secondly, a spectrogram style extractor based on VQ-VAE is pre-trained in a self-supervised manner, with plenty of audio data that covers complex style variations. Then a novel architecture with two encoder-decoder paths is specially designed to model the pronunciation and high-level style expressiveness respectively, with the guidance of the style extractor. Both objective and subjective evaluations demonstrate that our proposed method can effectively improve the naturalness and expressiveness of the synthesized speech in audiobook synthesis especially for the role and ou
Authors
(none)
Tags
Stats
Related papers
- Self-supervised Context-aware Style Representation For Expressive Speech Synthesis (2022)6.34
- Style-label-free: Cross-speaker Style Transfer By Quantized VAE And Speaker-wise Normalization In Speech Synthesis (2022)4.52
- Learning Latent Representations For Style Control And Transfer In End-to-end Speech Synthesis (2018)0.00
- Interpretable Style Transfer For Text-to-speech With Controlvae And Diffusion Bridge (2023)5.24
- Using Vaes And Normalizing Flows For One-shot Text-to-speech Synthesis Of Expressive Speech (2019)9.92
- Exploring Synthetic Data For Cross-speaker Style Transfer In Style Representation Based TTS (2024)0.00
- Context-aware Coherent Speaking Style Prediction With Hierarchical Transformers For Audiobook Speech Synthesis (2023)5.24
- Stylespeech: Parameter-efficient Fine Tuning For Pre-trained Controllable Text-to-speech (2024)6.34