Emotion Controllable Speech Synthesis Using Emotion-unlabeled Dataset With The Assistance Of Cross-domain Speech Emotion Recognition
2020 Β· Xiong Cai, Dongyang Dai, Zhiyong Wu, et al.
Abstract
Neural text-to-speech (TTS) approaches generally require a huge number of high quality speech data, which makes it difficult to obtain such a dataset with extra emotion labels. In this paper, we propose a novel approach for emotional TTS synthesis on a TTS dataset without emotion labels. Specifically, our proposed method consists of a cross-domain speech emotion recognition (SER) model and an emotional TTS model. Firstly, we train the cross-domain SER model on both SER and TTS datasets. Then, we use emotion labels on the TTS dataset predicted by the trained SER model to build an auxiliary SER task and jointly train it with the TTS model. Experimental results show that our proposed method can generate speech with the specified emotional expressiveness and nearly no hindering on the speech quality.
Authors
(none)
Tags
Stats
Related papers
- ED-TTS: Multi-scale Emotion Modeling Using Cross-domain Emotion Diarization For Emotional Speech Synthesis (2024)0.00
- Generative Emotional AI For Speech Emotion Recognition: The Case For Synthetic Emotional Speech Augmentation (2023)11.19
- A Methodology For Controlling The Emotional Expressiveness In Synthetic Speech -- A Deep Learning Approach (2019)5.84
- Semi-supervised Cross-lingual Speech Emotion Recognition (2022)10.85
- Emospeech: A Corpus Of Emotionally Rich And Contextually Detailed Speech Annotations (2024)0.00
- Leveraging Speech PTM, Text LLM, And Emotional TTS For Speech Emotion Recognition (2023)10.97
- Exploring Speech Style Spaces With Language Models: Emotional TTS Without Emotion Labels (2024)0.00
- EMOVIE: A Mandarin Emotion Speech Dataset With A Simple Emotional Text-to-speech Model (2021)0.00