Leveraging The Interplay Between Syntactic And Acoustic Cues For Optimizing Korean TTS Pause Formation
2024 Β· Yejin Jeon, Yunsu Kim, Gary Geunbae Lee
Abstract
Contemporary neural speech synthesis models have indeed demonstrated remarkable proficiency in synthetic speech generation as they have attained a level of quality comparable to that of human-produced speech. Nevertheless, it is important to note that these achievements have predominantly been verified within the context of high-resource languages such as English. Furthermore, the Tacotron and FastSpeech variants show substantial pausing errors when applied to the Korean language, which affects speech perception and naturalness. In order to address the aforementioned issues, we propose a novel framework that incorporates comprehensive modeling of both syntactic and acoustic cues that are associated with pausing patterns. Remarkably, our framework possesses the capability to consistently generate natural speech even for considerably more extended and intricate out-of-domain (OOD) sentences, despite its training on short audio clips. Architectural design choices are validated through com
Authors
(none)
Tags
Stats
Related papers
- Pausespeech: Natural Speech Synthesis Via Pre-trained Language Model And Pause-based Prosody Modeling (2023)2.26
- Duration-aware Pause Insertion Using Pre-trained Language Model For Multi-speaker Text-to-speech (2023)5.84
- Empirical Study Incorporating Linguistic Knowledge On Filled Pauses For Personalized Spontaneous Speech Synthesis (2022)0.00
- Phonology-guided Speech-to-speech Translation For African Languages (2024)2.26
- Modeling Prosodic Phrasing With Multi-task Learning In Tacotron-based TTS (2020)9.41
- Applying Syntax\(\unicode{x2013}\)prosody Mapping Hypothesis And Prosodic Well-formedness Constraints To Neural Sequence-to-sequence Speech Synthesis (2022)0.00
- Improving Robustness Of Spontaneous Speech Synthesis With Linguistic Speech Regularization And Pseudo-filled-pause Insertion (2022)2.26
- Empowering Global Voices: A Data-efficient, Phoneme-tone Adaptive Approach To High-fidelity Speech Synthesis (2025)0.00