Revealing Emotional Clusters In Speaker Embeddings: A Contrastive Learning Strategy For Speech Emotion Recognition
2024 Β· Ismail Rasim Ulgen, Zongyang Du, Carlos Busso, et al.
Abstract
Speaker embeddings carry valuable emotion-related information, which makes them a promising resource for enhancing speech emotion recognition (SER), especially with limited labeled data. Traditionally, it has been assumed that emotion information is indirectly embedded within speaker embeddings, leading to their under-utilization. Our study reveals a direct and useful link between emotion and state-of-the-art speaker embeddings in the form of intra-speaker clusters. By conducting a thorough clustering analysis, we demonstrate that emotion information can be readily extracted from speaker embeddings. In order to leverage this information, we introduce a novel contrastive pretraining approach applied to emotion-unlabeled data for speech emotion recognition. The proposed approach involves the sampling of positive and the negative examples based on the intra-speaker clusters of speaker embeddings. The proposed strategy, which leverages extensive emotion-unlabeled data, leads to a significa
Authors
(none)
Tags
Stats
Related papers
- Emotion Invariant Speaker Embeddings For Speaker Identification With Emotional Speech (2020)0.00
- Transforming The Embeddings: A Lightweight Technique For Speech Emotion Recognition Tasks (2023)7.50
- Unsupervised Representations Improve Supervised Learning In Speech Emotion Recognition (2023)0.00
- Supervised Contrastive Learning With Nearest Neighbor Search For Speech Emotion Recognition (2023)7.16
- Towards Interpretable And Transferable Speech Emotion Recognition: Latent Representation Based Analysis Of Features, Methods And Corpora (2021)0.00
- A Comparative Study Of Pre-trained Speech And Audio Embeddings For Speech Emotion Recognition (2023)0.00
- Exploring Self-supervised Multi-view Contrastive Learning For Speech Emotion Recognition With Limited Annotations (2024)3.58
- Trustser: On The Trustworthiness Of Fine-tuning Pre-trained Speech Embeddings For Speech Emotion Recognition (2023)9.07