Speaking Style Adaptation In Text-to-speech Synthesis Using Sequence-to-sequence Models With Attention
2018 Β· Bajibabu Bollepalli, Lauri Juvela, Paavo Alku
Abstract
Currently, there are increasing interests in text-to-speech (TTS) synthesis to use sequence-to-sequence models with attention. These models are end-to-end meaning that they learn both co-articulation and duration properties directly from text and speech. Since these models are entirely data-driven, they need large amounts of data to generate synthetic speech with good quality. However, in challenging speaking styles, such as Lombard speech, it is difficult to record sufficiently large speech corpora. Therefore, in this study we propose a transfer learning method to adapt a sequence-to-sequence based TTS system of normal speaking style to Lombard style. Moreover, we experiment with a WaveNet vocoder in synthesis of Lombard speech. We conducted subjective evaluations to assess the performance of the adapted TTS systems. The subjective evaluation results indicated that an adaptation system with the WaveNet vocoder clearly outperformed the conventional deep neural network based TTS system
Authors
(none)
Tags
Stats
Related papers
- Enhancing Speech Intelligibility In Text-to-speech Synthesis Using Speaking Style Conversion (2020)8.82
- Rapid Speaker Adaptation In Low Resource Text To Speech Systems Using Synthetic Data And Transfer Learning (2023)0.00
- Generalized Multilingual Text-to-speech Generation With Language-aware Style Adaptation (2025)0.00
- Speaker-adaptive Neural Vocoders For Parametric Speech Synthesis Systems (2018)2.26
- Styletts 2: Towards Human-level Text-to-speech Through Style Diffusion And Adversarial Training With Large Speech Language Models (2023)8.09
- Sample Efficient Adaptive Text-to-speech (2018)0.00
- Transplantation Of Conversational Speaking Style With Interjections In Sequence-to-sequence Speech Synthesis (2022)0.00
- Styletts: A Style-based Generative Model For Natural And Diverse Text-to-speech Synthesis (2022)10.97