Speech Emotion Recognition With Distilled Prosodic And Linguistic Affect Representations
2023 Β· Debaditya Shome, Ali Etemad
Abstract
We propose EmoDistill, a novel speech emotion recognition (SER) framework that leverages cross-modal knowledge distillation during training to learn strong linguistic and prosodic representations of emotion from speech. During inference, our method only uses a stream of speech signals to perform unimodal SER thus reducing computation overhead and avoiding run-time transcription and prosodic feature extraction errors. During training, our method distills information at both embedding and logit levels from a pair of pre-trained Prosodic and Linguistic teachers that are fine-tuned for SER. Experiments on the IEMOCAP benchmark demonstrate that our method outperforms other unimodal and multimodal techniques by a considerable margin, and achieves state-of-the-art performance of 77.49% unweighted accuracy and 78.91% weighted accuracy. Detailed ablation studies demonstrate the impact of each component of our method.
Authors
(none)
Tags
Stats
Related papers
- Multi-teacher Language-aware Knowledge Distillation For Multilingual Speech Emotion Recognition (2025)0.00
- Hierarchical Network With Decoupled Knowledge Distillation For Speech Emotion Recognition (2023)6.77
- Multi-level Knowledge Distillation For Speech Emotion Recognition In Noisy Conditions (2023)7.81
- Speech Emotion: Investigating Model Representations, Multi-task Learning And Knowledge Distillation (2022)6.34
- Distilled Hubert For Mobile Speech Emotion Recognition: A Cross-corpus Validation Study (2025)0.00
- Leveraging Semantic Information For Efficient Self-supervised Emotion Recognition With Audio-textual Distilled Models (2023)6.34
- Leveraging Content And Acoustic Representations For Speech Emotion Recognition (2024)2.26
- Speecheq: Speech Emotion Recognition Based On Multi-scale Unified Datasets And Multitask Learning (2022)5.84