Supervised And Unsupervised Approaches For Controlling Narrow Lexical Focus In Sequence-to-sequence Speech Synthesis
2021 Β· Slava Shechtman, Raul Fernandez, David Haws
Abstract
Although Sequence-to-Sequence (S2S) architectures have become state-of-the-art in speech synthesis, capable of generating outputs that approach the perceptual quality of natural samples, they are limited by a lack of flexibility when it comes to controlling the output. In this work we present a framework capable of controlling the prosodic output via a set of concise, interpretable, disentangled parameters. We apply this framework to the realization of emphatic lexical focus, proposing a variety of architectures designed to exploit different levels of supervision based on the availability of labeled resources. We evaluate these approaches via listening tests that demonstrate we are able to successfully realize controllable focus while maintaining the same, or higher, naturalness over an established baseline, and we explore how the different approaches compare when synthesizing in a target voice with or without labeled data.
Authors
(none)
Tags
Stats
Related papers
- Controllable Speech Synthesis By Learning Discrete Phoneme-level Prosodic Representations (2022)6.34
- Semi-supervised Learning For Continuous Emotional Intensity Controllable Speech Synthesis With Disentangled Representations (2022)0.00
- Emphasis Control For Parallel Neural TTS (2021)6.77
- Controllable Prosody Generation With Partial Inputs (2023)3.58
- PROEMO: Prompt-driven Text-to-speech Synthesis Based On Emotion And Intensity Control (2025)0.00
- Semi-supervised Generative Modeling For Controllable Speech Synthesis (2019)0.00
- Word-level Emotional Expression Control In Zero-shot Text-to-speech Synthesis (2025)0.00
- Prosodic Clustering For Phoneme-level Prosody Control In End-to-end Speech Synthesis (2021)5.84