Prosody Analysis Of Audiobooks
2023 Β· Charuta Pethe, Bach Pham, Felix D Childress, et al.
Abstract
Recent advances in text-to-speech have made it possible to generate natural-sounding audio from text. However, audiobook narrations involve dramatic vocalizations and intonations by the reader, with greater reliance on emotions, dialogues, and descriptions in the narrative. Using our dataset of 93 aligned book-audiobook pairs, we present improved models for prosody prediction properties (pitch, volume, and rate of speech) from narrative text using language modeling. Our predicted prosody attributes correlate much better with human audiobook readings than results from a state-of-the-art commercial TTS system: our predicted pitch shows a higher correlation with human reading for 22 out of the 24 books, while our predicted volume attribute proves more similar to human reading for 23 out of the 24 books. Finally, we present a human evaluation study to quantify the extent that people prefer prosody-enhanced audiobook readings over commercial text-to-speech systems.
Authors
(none)
Tags
Stats
Related papers
- Improving Speech Prosody Of Audiobook Text-to-speech Synthesis With Acoustic And Textual Contexts (2022)7.81
- Computational Narrative Understanding For Expressive Text-to-speech (2025)0.00
- Hierarchical Prosody Modeling For Non-autoregressive Speech Synthesis (2020)10.07
- Location, Location: Enhancing The Evaluation Of Text-to-speech Synthesis Using The Rapid Prosody Transcription Paradigm (2021)3.58
- Controllable Neural Text-to-speech Synthesis Using Intuitive Prosodic Features (2020)11.76
- Context-aware Coherent Speaking Style Prediction With Hierarchical Transformers For Audiobook Speech Synthesis (2023)5.24
- Audio-conditioned Phonemic And Prosodic Annotation For Building Text-to-speech Models From Unlabeled Speech Data (2024)3.58
- Dynamic Prosody Generation For Speech Synthesis Using Linguistics-driven Acoustic Embedding Selection (2019)7.81