Is Style All You Need? Dependencies Between Emotion And Gst-based Speaker Recognition
2022 Β· Morgan Sandler, Arun Ross
Abstract
In this work, we study the hypothesis that speaker identity embeddings extracted from speech samples may be used for detection and classification of emotion. In particular, we show that emotions can be effectively identified by learning speaker identities by use of a 1-D Triplet Convolutional Neural Network (CNN) & Global Style Token (GST) scheme (e.g., DeepTalk Network) and reusing the trained speaker recognition model weights to generate features in the emotion classification domain. The automatic speaker recognition (ASR) network is trained with VoxCeleb1, VoxCeleb2, and Librispeech datasets with a triplet training loss function using speaker identity labels. Using an Support Vector Machine (SVM) classifier, we map speaker identity embeddings into discrete emotion categories from the CREMA-D, IEMOCAP, and MSP-Podcast datasets. On the task of speech emotion detection, we obtain 80.8% ACC with acted emotion samples from CREMA-D, 81.2% ACC with semi-natural emotion samples in IEMOCAP,
Authors
(none)
Tags
Stats
Related papers
- Vocal Style Factorization For Effective Speaker Recognition In Affective Scenarios (2023)0.00
- Identifying Speakers Using Their Emotion Cues (2018)10.85
- Improving Speech Emotion Recognition With Unsupervised Speaking Style Transfer (2022)6.34
- Emodiarize: Speaker Diarization And Emotion Identification From Speech Signals Using Convolutional Neural Networks (2023)0.00
- End-to-end Emotional Speech Synthesis Using Style Tokens And Semi-supervised Training (2019)12.87
- X-vectors Meet Emotions: A Study On Dependencies Between Emotion And Speaker Recognition (2020)14.23
- SPEAKER VGG CCT: Cross-corpus Speech Emotion Recognition With Speaker Embedding And Vision Transformers (2022)2.83
- Novel Cascaded Gaussian Mixture Model-deep Neural Network Classifier For Speaker Identification In Emotional Talking Environments (2018)12.74