On The Use Of Self-supervised Pre-trained Acoustic And Linguistic Features For Continuous Speech Emotion Recognition
2020 · Manon MacAry, Marie Tahon, Yannick Estève, et al.
Abstract
Pre-training for feature extraction is an increasingly studied approach to get better continuous representations of audio and text content. In the present work, we use wav2vec and camemBERT as self-supervised learned models to represent our data in order to perform continuous emotion recognition from speech (SER) on AlloSat, a large French emotional database describing the satisfaction dimension, and on the state of the art corpus SEWA focusing on valence, arousal and liking dimensions. To the authors' knowledge, this paper presents the first study showing that the joint use of wav2vec and BERT-like pre-trained features is very relevant to deal with continuous SER task, usually characterized by a small amount of labeled training data. Evaluated by the well-known concordance correlation coefficient (CCC), our experiments show that we can reach a CCC value of 0.825 instead of 0.592 when using MFCC in conjunction with word2vec word embedding on the AlloSat dataset.
Authors
(none)
Tags
Stats
Related papers
- Unsupervised Representations Improve Supervised Learning In Speech Emotion Recognition (2023)0.00
- Speaker Emotion Recognition: Leveraging Self-supervised Models For Feature Extraction Using Wav2vec2 And Hubert (2024)0.00
- Continuous Metric Learning For Transferable Speech Emotion Recognition And Embedding Across Low-resource Languages (2022)0.00
- Leveraging Content And Acoustic Representations For Speech Emotion Recognition (2024)2.26
- Supervised Contrastive Learning With Nearest Neighbor Search For Speech Emotion Recognition (2023)7.16
- End-to-end Integration Of Speech Emotion Recognition With Voice Activity Detection Using Self-supervised Learning Features (2024)0.00
- Towards Interpretable And Transferable Speech Emotion Recognition: Latent Representation Based Analysis Of Features, Methods And Corpora (2021)0.00
- Dawn Of The Transformer Era In Speech Emotion Recognition: Closing The Valence Gap (2022)18.59