Computational Narrative Understanding For Expressive Text-to-speech
2025 Β· Gaspard Michel, Elena V. Epure, Christophe Cerisara
Abstract
Recent advances in text-to-speech (TTS) have been driven by large, multi-domain speech corpora, yet the expressive potential of audiobook data remains underexamined. We argue that human-narrated audiobooks, particularly fictional works, contain rich and diverse prosodic cues arising from the natural alternation between neutral narration and expressive character dialogue. Building from this observation, we introduce LibriQuote, a large-scale 5.3K hours of expressive speech drawn from character quotations. Each quote is supplemented with contextual pseudo-labels for speech verbs and adverbs that characterize the intended delivery of direct speech (e.g., "he whispered softly"). We found that fine-tuning a flow-matching model on LibriQuote yields substantial improvements in expressivity and intelligibility, while training from scratch enhances expressiveness of an autoregressive TTS model. Benchmarking on LibriQuote-test highlights significant variability across systems in generating expre
Authors
(none)
Tags
Stats
Related papers
- Low-resource Expressive Text-to-speech Using Data Augmentation (2020)11.29
- Non-autoregressive TTS With Explicit Duration Modelling For Low-resource Highly Expressive Speech (2021)8.82
- Storytts: A Highly Expressive Text-to-speech Dataset With Rich Textual Expressiveness Annotations (2024)3.58
- Prosody Analysis Of Audiobooks (2023)0.00
- Instructtts: Modelling Expressive TTS In Discrete Latent Space With Natural Language Style Prompt (2023)0.00
- Comedicspeech: Text To Speech For Stand-up Comedies In Low-resource Scenarios (2023)0.00
- Libritts-vi: A Public Corpus And Novel Methods For Efficient Voice Impression Control (2025)0.00
- A Methodology For Controlling The Emotional Expressiveness In Synthetic Speech -- A Deep Learning Approach (2019)5.84