Enhancing Speech Intelligibility In Text-to-speech Synthesis Using Speaking Style Conversion
2020 Β· Dipjyoti Paul, Muhammed Pv Shifas, Yannis Pantazis, et al.
Abstract
The increased adoption of digital assistants makes text-to-speech (TTS) synthesis systems an indispensable feature of modern mobile devices. It is hence desirable to build a system capable of generating highly intelligible speech in the presence of noise. Past studies have investigated style conversion in TTS synthesis, yet degraded synthesized quality often leads to worse intelligibility. To overcome such limitations, we proposed a novel transfer learning approach using Tacotron and WaveRNN based TTS synthesis. The proposed speech system exploits two modification strategies: (a) Lombard speaking style data and (b) Spectral Shaping and Dynamic Range Compression (SSDRC) which has been shown to provide high intelligibility gains by redistributing the signal energy on the time-frequency domain. We refer to this extension as Lombard-SSDRC TTS system. Intelligibility enhancement as quantified by the Intelligibility in Bits (SIIB-Gauss) measure shows that the proposed Lombard-SSDRC TTS syste
Authors
(none)
Tags
Stats
Related papers
- Speaking Style Adaptation In Text-to-speech Synthesis Using Sequence-to-sequence Models With Attention (2018)0.00
- Stylespeech: Parameter-efficient Fine Tuning For Pre-trained Controllable Text-to-speech (2024)6.34
- Transplantation Of Conversational Speaking Style With Interjections In Sequence-to-sequence Speech Synthesis (2022)0.00
- STYLER: Style Factor Modeling With Rapidity And Robustness Via Speech Decomposition For Expressive And Controllable Neural Text To Speech (2021)9.23
- Styletts: A Style-based Generative Model For Natural And Diverse Text-to-speech Synthesis (2022)10.97
- CALM: Contrastive Cross-modal Speaking Style Modeling For Expressive Text-to-speech Synthesis (2023)6.77
- Style-talker: Finetuning Audio Language Model And Style-based Text-to-speech Model For Fast Spoken Dialogue Generation (2024)0.00
- End-to-end Text-to-speech Based On Latent Representation Of Speaking Styles Using Spontaneous Dialogue (2022)8.35