Abstract

In this manuscript, the topic of multi-corpus Speech Emotion Recognition (SER) is approached from a deep transfer learning perspective. A large corpus of emotional speech data, EmoSet, is assembled from a number of existing SER corpora. In total, EmoSet contains 84181 audio recordings from 26 SER corpora with a total duration of over 65 hours. The corpus is then utilised to create a novel framework for multi-corpus speech emotion recognition, namely EmoNet. A combination of a deep ResNet architecture and residual adapters is transferred from the field of multi-domain visual recognition to multi-corpus SER on EmoSet. Compared against two suitable baselines and more traditional training and transfer settings for the ResNet, the residual adapter approach enables parameter efficient training of a multi-domain SER model on all 26 corpora. A shared model with only \(3.5\) times the number of parameters of a model trained on a single database leads to increased performance for 21 of the 26 co

Authors

(none)

Tags

  • Speech Recognition

Stats

  • citations0
  • S2 citationsβ€”
  • github stars29
  • HF likes0
  • heat score2.95
  • arxiv keygerczuk2021emonet

Related papers