Word-level Emotional Expression Control In Zero-shot Text-to-speech Synthesis
2025 Β· Tianrui Wang, Haoyu Wang, Meng Ge, et al.
Abstract
While emotional text-to-speech (TTS) has made significant progress, most existing research remains limited to utterance-level emotional expression and fails to support word-level control. Achieving word-level expressive control poses fundamental challenges, primarily due to the complexity of modeling multi-emotion transitions and the scarcity of annotated datasets that capture intra-sentence emotional and prosodic variation. In this paper, we propose WeSCon, the first self-training framework that enables word-level control of both emotion and speaking rate in a pretrained zero-shot TTS model, without relying on datasets containing intra-sentence emotion or speed transitions. Our method introduces a transition-smoothing strategy and a dynamic speed control mechanism to guide the pretrained TTS model in performing word-level expressive synthesis through a multi-round inference process. To further simplify the inference, we incorporate a dynamic emotional attention bias mechanism and fine
Authors
(none)
Tags
Stats
Related papers
- Fine-grained Emotional Control Of Text-to-speech: Learning To Rank Inter- And Intra-class Emotion Intensities (2023)6.77
- Zet-speech: Zero-shot Adaptive Emotion-controllable Text-to-speech Synthesis With Diffusion And Style-based Models (2023)8.09
- PROEMO: Prompt-driven Text-to-speech Synthesis Based On Emotion And Intensity Control (2025)0.00
- A Methodology For Controlling The Emotional Expressiveness In Synthetic Speech -- A Deep Learning Approach (2019)5.84
- Emotional Speech Synthesis With Rich And Granularized Control (2019)13.39
- Laugh Now Cry Later: Controlling Time-varying Emotional States Of Flow-matching-based Zero-shot Text-to-speech (2024)6.77
- Emotional Dimension Control In Language Model-based Text-to-speech: Spanning A Broad Spectrum Of Human Emotions (2024)0.00
- Semi-supervised Learning For Continuous Emotional Intensity Controllable Speech Synthesis With Disentangled Representations (2022)0.00