Speech Emotion Recognition With Dual-sequence LSTM Architecture
2019 Β· Jianyou Wang, Michael Xue, Ryan Culhane, et al.
Abstract
Speech Emotion Recognition (SER) has emerged as a critical component of the next generation human-machine interfacing technologies. In this work, we propose a new dual-level model that predicts emotions based on both MFCC features and mel-spectrograms produced from raw audio signals. Each utterance is preprocessed into MFCC features and two mel-spectrograms at different time-frequency resolutions. A standard LSTM processes the MFCC features, while a novel LSTM architecture, denoted as Dual-Sequence LSTM (DS-LSTM), processes the two mel-spectrograms simultaneously. The outputs are later averaged to produce a final classification of the utterance. Our proposed model achieves, on average, a weighted accuracy of 72.7% and an unweighted accuracy of 73.3%---a 6% improvement over current state-of-the-art unimodal models---and is comparable with multimodal models that leverage textual information as well as audio signals.
Authors
(none)
Tags
Stats
Related papers
- Improvement And Implementation Of A Speech Emotion Recognition Model Based On Dual-layer LSTM (2024)0.00
- Multilingual Speech Emotion Recognition With Multi-gating Mechanism And Neural Architecture Search (2022)2.26
- Leveraged Mel Spectrograms Using Harmonic And Percussive Components In Speech Emotion Recognition (2023)9.03
- Two-stage Dimensional Emotion Recognition By Fusing Predictions Of Acoustic And Text Networks Using SVM (2022)12.10
- Enhanced Speech Emotion Recognition With Efficient Channel Attention Guided Deep Cnn-bilstm Framework (2024)0.00
- Improved Speech Emotion Recognition Using Transfer Learning And Spectrogram Augmentation (2021)12.74
- Hybrid Data Augmentation And Deep Attention-based Dilated Convolutional-recurrent Neural Networks For Speech Emotion Recognition (2021)12.81
- Speech Emotion Recognition With Co-attention Based Multi-level Acoustic Information (2022)16.17