Contextspeech: Expressive And Efficient Text-to-speech For Paragraph Reading
2023 Β· Yujia Xiao, Shaofei Zhang, Xi Wang, et al.
Abstract
While state-of-the-art Text-to-Speech systems can generate natural speech of very high quality at sentence level, they still meet great challenges in speech generation for paragraph / long-form reading. Such deficiencies are due to i) ignorance of cross-sentence contextual information, and ii) high computation and memory cost for long-form synthesis. To address these issues, this work develops a lightweight yet effective TTS system, ContextSpeech. Specifically, we first design a memory-cached recurrence mechanism to incorporate global text and speech context into sentence encoding. Then we construct hierarchically-structured textual semantics to broaden the scope for global context enhancement. Additionally, we integrate linearized self-attention to improve model efficiency. Experiments show that ContextSpeech significantly improves the voice quality and prosody expressiveness in paragraph reading with competitive model efficiency. Audio samples are available at: https://contextspeech.
Authors
(none)
Tags
Stats
Related papers
- Paratts: Learning Linguistic And Prosodic Cross-sentence Information In Paragraph-based TTS (2022)8.82
- Text Enhancement For Paragraph Processing In End-to-end Code-switching TTS (2022)0.00
- Simple And Effective Multi-sentence TTS With Expressive And Coherent Prosody (2022)7.16
- Maskedspeech: Context-aware Speech Synthesis With Masking Strategy (2022)4.52
- Improving Audio Codec-based Zero-shot Text-to-speech Synthesis With Multi-modal Context And Large Language Model (2024)2.26
- Exploiting Deep Sentential Context For Expressive End-to-end Speech Synthesis (2020)5.84
- Speak, Read And Prompt: High-fidelity Text-to-speech With Minimal Supervision (2023)0.00
- Stylespeech: Parameter-efficient Fine Tuning For Pre-trained Controllable Text-to-speech (2024)6.34