End-to-end Transfer Learning For Speaker-independent Cross-language And Cross-corpus Speech Emotion Recognition

Abstract

Data-driven models achieve successful results in Speech Emotion Recognition (SER). However, these models, which are often based on general acoustic features or end-to-end approaches, show poor performance when the testing set has a different language than the training set or when these sets are taken from different datasets. To alleviate these problems, this paper presents an end-to-end Deep Neural Network (DNN) model based on transfer learning for cross-language and cross-corpus SER. We use the wav2vec 2.0 pre-trained model to transform audio time-domain waveforms from different languages, different speakers and different recording conditions into a feature space shared by multiple languages, thereby reducing the language variabilities in the speech embeddings. Next, we propose a new Deep-Within-Class Covariance Normalisation (Deep-WCCN) layer that can be inserted into the DNN model and aims to reduce other variabilities including speaker variability, channel variability and so on. Th

End-to-end Transfer Learning For Speaker-independent Cross-language And Cross-corpus Speech Emotion Recognition

Abstract

Authors

Tags

Stats

Related papers