Zet-speech: Zero-shot Adaptive Emotion-controllable Text-to-speech Synthesis With Diffusion And Style-based Models
2023 Β· Minki Kang, Wooseok Han, Sung Ju Hwang, et al.
Abstract
Emotional Text-To-Speech (TTS) is an important task in the development of systems (e.g., human-like dialogue agents) that require natural and emotional speech. Existing approaches, however, only aim to produce emotional TTS for seen speakers during training, without consideration of the generalization to unseen speakers. In this paper, we propose ZET-Speech, a zero-shot adaptive emotion-controllable TTS model that allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label. Specifically, to enable a zero-shot adaptive TTS model to synthesize emotional speech, we propose domain adversarial learning and guidance methods on the diffusion model. Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers. Samples are at https://ZET-Speech.github.io/ZET-Speech-Demo/.
Authors
(none)
Tags
Stats
Related papers
- Styletts-zs: Efficient High-quality Zero-shot Text-to-speech Synthesis With Distilled Time-varying Style Diffusion (2024)3.58
- Humane Speech Synthesis Through Zero-shot Emotion And Disfluency Generation (2024)0.00
- Emomix: Emotion Mixing Via Diffusion Models For Emotional Speech Synthesis (2023)0.00
- ZSDEVC: Zero-shot Diffusion-based Emotional Voice Conversion With Disentangled Mechanism (2024)0.00
- Word-level Emotional Expression Control In Zero-shot Text-to-speech Synthesis (2025)0.00
- Emosphere-tts: Emotional Style And Intensity Modeling Via Spherical Emotion Vector For Controllable Emotional Text-to-speech (2024)10.35
- Textless And Non-parallel Speech-to-speech Emotion Style Transfer (2025)0.00
- Laugh Now Cry Later: Controlling Time-varying Emotional States Of Flow-matching-based Zero-shot Text-to-speech (2024)6.77