Prosodyfm: Unsupervised Phrasing And Intonation Control For Intelligible Speech Synthesis
2024 Β· Xiangheng He, Junjie Chen, Zixing Zhang, et al.
Abstract
Prosody contains rich information beyond the literal meaning of words, which is crucial for the intelligibility of speech. Current models still fall short in phrasing and intonation; they not only miss or misplace breaks when synthesizing long sentences with complex structures but also produce unnatural intonation. We propose ProsodyFM, a prosody-aware text-to-speech synthesis (TTS) model with a flow-matching (FM) backbone that aims to enhance the phrasing and intonation aspects of prosody. ProsodyFM introduces two key components: a Phrase Break Encoder to capture initial phrase break locations, followed by a Duration Predictor for the flexible adjustment of break durations; and a Terminal Intonation Encoder which learns a bank of intonation shape tokens combined with a novel Pitch Processor for more robust modeling of human-perceived intonation change. ProsodyFM is trained with no explicit prosodic labels and yet can uncover a broad spectrum of break durations and intonation patterns.
Authors
(none)
Tags
Stats
Related papers
- Pausespeech: Natural Speech Synthesis Via Pre-trained Language Model And Pause-based Prosody Modeling (2023)2.26
- Controllable Speech Synthesis By Learning Discrete Phoneme-level Prosodic Representations (2022)6.34
- Perception Of Prosodic Variation For Speech Synthesis Using An Unsupervised Discrete Representation Of F0 (2020)7.81
- Prosodic Clustering For Phoneme-level Prosody Control In End-to-end Speech Synthesis (2021)5.84
- Prosody-controllable Spontaneous TTS With Neural Hmms (2022)8.09
- Hierarchical Prosody Modeling For Non-autoregressive Speech Synthesis (2020)10.07
- Modeling Prosodic Phrasing With Multi-task Learning In Tacotron-based TTS (2020)9.41
- Controllable Neural Text-to-speech Synthesis Using Intuitive Prosodic Features (2020)11.76