Vocal Effort Modeling In Neural TTS For Improving The Intelligibility Of Synthetic Speech In Noise
2022 · Tuomo Raitio, Petko Petkov, Jiangchuan Li, et al.
Abstract
We present a neural text-to-speech (TTS) method that models natural vocal effort variation to improve the intelligibility of synthetic speech in the presence of noise. The method consists of first measuring the spectral tilt of unlabeled conventional speech data, and then conditioning a neural TTS model with normalized spectral tilt among other prosodic factors. Changing the spectral tilt parameter and keeping other prosodic factors unchanged enables effective vocal effort control at synthesis time independent of other prosodic factors. By extrapolation of the spectral tilt values beyond what has been seen in the original data, we can generate speech with high vocal effort levels, thus improving the intelligibility of speech in the presence of masking noise. We evaluate the intelligibility and quality of normal speech and speech with increased vocal effort in the presence of various masking noise conditions, and compare these to well-known speech intelligibility-enhancing algorithms. T
Authors
(none)
Tags
Stats
Related papers
- Controllable Neural Text-to-speech Synthesis Using Intuitive Prosodic Features (2020)11.76
- Non-autoregressive TTS With Explicit Duration Modelling For Low-resource Highly Expressive Speech (2021)8.82
- Efficient Neural Speech Synthesis For Low-resource Languages Through Multilingual Modeling (2020)8.60
- Hierarchical Prosody Modeling And Control In Non-autoregressive Parallel Neural TTS (2021)8.35
- Environment Aware Text-to-speech Synthesis (2021)6.34
- A Study On The Efficacy Of Model Pre-training In Developing Neural Text-to-speech System (2021)2.26
- Improving Multi-speaker TTS Prosody Variance With A Residual Encoder And Normalizing Flows (2021)0.00
- Prosody-controllable Spontaneous TTS With Neural Hmms (2022)8.09