End-to-end Transfer Learning For Speaker-independent Cross-language And Cross-corpus Speech Emotion Recognition
2023 Β· Duowei Tang, Peter Kuppens, Lucca Geurts, et al.
Abstract
Data-driven models achieve successful results in Speech Emotion Recognition (SER). However, these models, which are often based on general acoustic features or end-to-end approaches, show poor performance when the testing set has a different language than the training set or when these sets are taken from different datasets. To alleviate these problems, this paper presents an end-to-end Deep Neural Network (DNN) model based on transfer learning for cross-language and cross-corpus SER. We use the wav2vec 2.0 pre-trained model to transform audio time-domain waveforms from different languages, different speakers and different recording conditions into a feature space shared by multiple languages, thereby reducing the language variabilities in the speech embeddings. Next, we propose a new Deep-Within-Class Covariance Normalisation (Deep-WCCN) layer that can be inserted into the DNN model and aims to reduce other variabilities including speaker variability, channel variability and so on. Th
Authors
(none)
Tags
Stats
Related papers
- Transfer Learning For Improving Speech Emotion Classification Accuracy (2018)15.10
- Continuous Metric Learning For Transferable Speech Emotion Recognition And Embedding Across Low-resource Languages (2022)0.00
- Emonet: A Transfer Learning Framework For Multi-corpus Speech Emotion Recognition (2021)2.95
- Unsupervised Cross-lingual Speech Emotion Recognition Using Domainadversarial Neural Network (2020)0.00
- Sigwavnet: Learning Multiresolution Signal Wavelet Network For Speech Emotion Recognition (2025)8.48
- SPEAKER VGG CCT: Cross-corpus Speech Emotion Recognition With Speaker Embedding And Vision Transformers (2022)2.83
- Supervised Contrastive Learning With Nearest Neighbor Search For Speech Emotion Recognition (2023)7.16
- Cross-speaker Emotion Transfer Based On Speaker Condition Layer Normalization And Semi-supervised Training In Text-to-speech (2021)0.00