Improving Speech Prosody Of Audiobook Text-to-speech Synthesis With Acoustic And Textual Contexts
2022 Β· Detai Xin, Sharath Adavanne, Federico Ang, et al.
Abstract
We present a multi-speaker Japanese audiobook text-to-speech (TTS) system that leverages multimodal context information of preceding acoustic context and bilateral textual context to improve the prosody of synthetic speech. Previous work either uses unilateral or single-modality context, which does not fully represent the context information. The proposed method uses an acoustic context encoder and a textual context encoder to aggregate context information and feeds it to the TTS model, which enables the model to predict context-dependent prosody. We conducted comprehensive objective and subjective evaluations on a multi-speaker Japanese audiobook dataset. Experimental results demonstrate that the proposed method significantly outperforms two previous works. Additionally, we present insights about the different choices of context - modalities, lateral information and length - for audiobook TTS that have never been discussed in the literature before.
Authors
(none)
Tags
Stats
Related papers
- Improving Audio Codec-based Zero-shot Text-to-speech Synthesis With Multi-modal Context And Large Language Model (2024)2.26
- Using Previous Acoustic Context To Improve Text-to-speech Synthesis (2020)0.00
- Clapspeech: Learning Prosody From Text Context With Contrastive Language-audio Pre-training (2023)0.00
- Leveraging Acoustic Contextual Representation By Audio-textual Cross-modal Learning For Conversational ASR (2022)0.00
- Context-aware Coherent Speaking Style Prediction With Hierarchical Transformers For Audiobook Speech Synthesis (2023)5.24
- Prosody Analysis Of Audiobooks (2023)0.00
- Improving Prosody Modelling With Cross-utterance BERT Embeddings For End-to-end Speech Synthesis (2020)10.61
- Dynamic Prosody Generation For Speech Synthesis Using Linguistics-driven Acoustic Embedding Selection (2019)7.81