PerTTS: Personalized and Controllable Zero-Shot Spontaneous Style Text-to-Speech Synthesis

Abstract

In spoken scenarios, achieving personalized and controllable zero-shot spontaneous style speech synthesis is highly significant, particularly in generating natural and expressive speech for unseen speakers under data-limited conditions. Traditional methods typically achieve this by fine-tuning pre-trained multi-speaker speech synthesis models or adopting zero-shot adaptation techniques. However, these methods exhibit limitations in voice cloning and style modeling, struggling to capture fine-grained voice characteristics and complex speaking styles of target speakers. In this paper, we propose PerTTS, a personalized and controllable zero-shot spontaneous speech synthesis method. This approach introduces a personalized speaking style encoder that utilizes pre-trained models and a local prosody encoder to extract semantic, duration, timbre and prosody information from multiple reference utterances of the target speaker, thereby forming a comprehensive personalized representation of speaking style. Furthermore, we employ knowledge distillation to learn spontaneous behavior patterns and incorporate a multi-modal pseudo label detector to extract labels from unlabeled data, enabling modeling and control of spontaneous behaviors. This mechanism significantly enhances the naturalness and spontaneity of the synthesized speech. Experimental results demonstrate that PerTTS significantly outperforms existing models in terms of speaking style similarity and speech naturalness. The introduction of personalized speaking style representations effectively improves style similarity, and the incorporation of spontaneous behavior modeling further improves the naturalness and spontaneity of the synthesized speech, while enabling controllable generation of spontaneous behaviors.

Abstract

Related papers