Improved Speech Emotion Recognition Using Transfer Learning And Spectrogram Augmentation
2021 Β· Sarala Padi, Seyed Omid Sadjadi, Dinesh Manocha, et al.
Abstract
Automatic speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction. One of the main challenges in SER is data scarcity, i.e., insufficient amounts of carefully labeled data to build and fully explore complex deep learning models for emotion classification. This paper aims to address this challenge using a transfer learning strategy combined with spectrogram augmentation. Specifically, we propose a transfer learning approach that leverages a pre-trained residual network (ResNet) model including a statistics pooling layer from speaker recognition trained using large amounts of speaker-labeled data. The statistics pooling layer enables the model to efficiently process variable-length input, thereby eliminating the need for sequence truncation which is commonly used in SER systems. In addition, we adopt a spectrogram augmentation technique to generate additional training data samples by applying random time-frequency masks to lo
Authors
(none)
Tags
Stats
Related papers
- Continuous Metric Learning For Transferable Speech Emotion Recognition And Embedding Across Low-resource Languages (2022)0.00
- Towards Interpretable And Transferable Speech Emotion Recognition: Latent Representation Based Analysis Of Features, Methods And Corpora (2021)0.00
- Generative Data Augmentation Guided By Triplet Loss For Speech Emotion Recognition (2022)3.58
- Leveraged Mel Spectrograms Using Harmonic And Percussive Components In Speech Emotion Recognition (2023)9.03
- Speech Emotion Recognition With Multiscale Area Attention And Data Augmentation (2021)13.65
- Emonet: A Transfer Learning Framework For Multi-corpus Speech Emotion Recognition (2021)2.95
- Hybrid Data Augmentation And Deep Attention-based Dilated Convolutional-recurrent Neural Networks For Speech Emotion Recognition (2021)12.81
- Foundation Model Assisted Automatic Speech Emotion Recognition: Transcribing, Annotating, And Augmenting (2023)0.00