Representation Learning Through Cross-modal Conditional Teacher-student Training For Speech Emotion Recognition
2021 Β· Sundararajan Srinivasan, Zhaocheng Huang, Katrin Kirchhoff
Abstract
Generic pre-trained speech and text representations promise to reduce the need for large labeled datasets on specific speech and language tasks. However, it is not clear how to effectively adapt these representations for speech emotion recognition. Recent public benchmarks show the efficacy of several popular self-supervised speech representations for emotion classification. In this study, we show that the primary difference between the top-performing representations is in predicting valence while the differences in predicting activation and dominance dimensions are less pronounced. However, we show that even the best-performing HuBERT representation underperforms on valence prediction compared to a multimodal model that also incorporates text representation. We address this shortcoming by injecting lexical information into the speech representation using the multimodal model as a teacher. To improve the efficacy of our approach, we propose a novel estimate of the quality of the emotio
Authors
(none)
Tags
Stats
Related papers
- Pre-trained Model Representations And Their Robustness Against Noise For Speech Emotion Analysis (2023)0.00
- Speech Emotion: Investigating Model Representations, Multi-task Learning And Knowledge Distillation (2022)6.34
- Effect Of Attention And Self-supervised Speech Embeddings On Non-semantic Speech Tasks (2023)4.52
- Investigating Salient Representations And Label Variance In Dimensional Speech Emotion Analysis (2023)3.58
- Learning Representations Of Emotional Speech With Deep Convolutional Generative Adversarial Networks (2017)0.00
- Emotion2vec: Self-supervised Pre-training For Speech Emotion Representation (2023)15.88
- Jointly Fine-tuning "bert-like" Self Supervised Models To Improve Multimodal Speech Emotion Recognition (2020)13.74
- Attention-augmented End-to-end Multi-task Learning For Emotion Prediction From Speech (2019)13.50