In Other News: A Bi-style Text-to-speech Model For Synthesizing Newscaster Voice With Limited Data
2019 · Nishant Prateek, Mateusz Łajszczak, Roberto Barra-Chicote, et al.
Abstract
Neural text-to-speech synthesis (NTTS) models have shown significant progress in generating high-quality speech, however they require a large quantity of training data. This makes creating models for multiple styles expensive and time-consuming. In this paper different styles of speech are analysed based on prosodic variations, from this a model is proposed to synthesise speech in the style of a newscaster, with just a few hours of supplementary data. We pose the problem of synthesising in a target style using limited data as that of creating a bi-style model that can synthesise both neutral-style and newscaster-style speech via a one-hot vector which factorises the two styles. We also propose conditioning the model on contextual word embeddings, and extensively evaluate it against neutral NTTS, and neutral concatenative-based synthesis. This model closes the gap in perceived style-appropriateness between natural recordings for newscaster-style of speech, and neutral speech synthesis b
Authors
(none)
Tags
Stats
Related papers
- Low-resource Expressive Text-to-speech Using Data Augmentation (2020)11.29
- Styletts: A Style-based Generative Model For Natural And Diverse Text-to-speech Synthesis (2022)10.97
- Efficient Neural Speech Synthesis For Low-resource Languages Through Multilingual Modeling (2020)8.60
- Text-driven Emotional Style Control And Cross-speaker Style Transfer In Neural TTS (2022)7.81
- STYLER: Style Factor Modeling With Rapidity And Robustness Via Speech Decomposition For Expressive And Controllable Neural Text To Speech (2021)9.23
- Spontaneous Style Text-to-speech Synthesis With Controllable Spontaneous Behaviors Based On Language Models (2024)7.81
- Improving The Quality Of Neural TTS Using Long-form Content And Multi-speaker Multi-style Modeling (2022)3.58
- Non-autoregressive TTS With Explicit Duration Modelling For Low-resource Highly Expressive Speech (2021)8.82