Stable-tts: Stable Speaker-adaptive Text-to-speech Synthesis Via Prosody Prompting
2024 Β· Wooseok Han, Minki Kang, Changhun Kim, et al.
Abstract
Speaker-adaptive Text-to-Speech (TTS) synthesis has attracted considerable attention due to its broad range of applications, such as personalized voice assistant services. While several approaches have been proposed, they often exhibit high sensitivity to either the quantity or the quality of target speech samples. To address these limitations, we introduce Stable-TTS, a novel speaker-adaptive TTS framework that leverages a small subset of a high-quality pre-training dataset, referred to as prior samples. Specifically, Stable-TTS achieves prosody consistency by leveraging the high-quality prosody of prior samples, while effectively capturing the timbre of the target speaker. Additionally, it employs a prior-preservation loss during fine-tuning to maintain the synthesis ability for prior samples to prevent overfitting on target samples. Extensive experiments demonstrate the effectiveness of Stable-TTS even under limited amounts of and noisy target speech samples.
Authors
(none)
Tags
Stats
Related papers
- Prosody-controllable Spontaneous TTS With Neural Hmms (2022)8.09
- Speak, Read And Prompt: High-fidelity Text-to-speech With Minimal Supervision (2023)0.00
- Styletts: A Style-based Generative Model For Natural And Diverse Text-to-speech Synthesis (2022)10.97
- Generating Speakers By Prompting Listener Impressions For Pre-trained Multi-speaker Text-to-speech Systems (2024)3.58
- Empowering Global Voices: A Data-efficient, Phoneme-tone Adaptive Approach To High-fidelity Speech Synthesis (2025)0.00
- Controllable Neural Text-to-speech Synthesis Using Intuitive Prosodic Features (2020)11.76
- PROEMO: Prompt-driven Text-to-speech Synthesis Based On Emotion And Intensity Control (2025)0.00
- Dynamic Prosody Generation For Speech Synthesis Using Linguistics-driven Acoustic Embedding Selection (2019)7.81