Visualization And Interpretation Of Latent Spaces For Controlling Expressive Speech Synthesis Through Audio Analysis
2019 Β· NoΓ© Tits, Fengna Wang, Kevin El Haddad, et al.
Abstract
The field of Text-to-Speech has experienced huge improvements last years benefiting from deep learning techniques. Producing realistic speech becomes possible now. As a consequence, the research on the control of the expressiveness, allowing to generate speech in different styles or manners, has attracted increasing attention lately. Systems able to control style have been developed and show impressive results. However the control parameters often consist of latent variables and remain complex to interpret. In this paper, we analyze and compare different latent spaces and obtain an interpretation of their influence on expressive speech. This will enable the possibility to build controllable speech synthesis systems with an understandable behaviour.
Authors
(none)
Tags
Stats
Related papers
- Learning Utterance-level Representations Through Token-level Acoustic Latents Prediction For Expressive Speech Synthesis (2022)0.00
- The Theory Behind Controllable Expressive Speech Synthesis: A Cross-disciplinary Approach (2019)2.26
- A Methodology For Controlling The Emotional Expressiveness In Synthetic Speech -- A Deep Learning Approach (2019)5.84
- A Study On Altering The Latent Space Of Pretrained Text To Speech Models For Improved Expressiveness (2023)0.00
- Analysis And Assessment Of Controllability Of An Expressive Deep Learning-based TTS System (2021)5.24
- Towards Controllable Speech Synthesis In The Era Of Large Language Models: A Systematic Survey (2024)4.75
- Semi-supervised Learning For Continuous Emotional Intensity Controllable Speech Synthesis With Disentangled Representations (2022)0.00
- Hierarchical Multi-grained Generative Model For Expressive Speech Synthesis (2020)8.60