Fusion Approaches For Emotion Recognition From Speech Using Acoustic And Text-based Features
2024 Β· Leonardo Pepino, Pablo Riera, Luciana Ferrer, et al.
Abstract
In this paper, we study different approaches for classifying emotions from speech using acoustic and text-based features. We propose to obtain contextualized word embeddings with BERT to represent the information contained in speech transcriptions and show that this results in better performance than using Glove embeddings. We also propose and compare different strategies to combine the audio and text modalities, evaluating them on IEMOCAP and MSP-PODCAST datasets. We find that fusing acoustic and text-based systems is beneficial on both datasets, though only subtle differences are observed across the evaluated fusion approaches. Finally, for IEMOCAP, we show the large effect that the criteria used to define the cross-validation folds have on results. In particular, the standard way of creating folds for this dataset results in a highly optimistic estimation of performance for the text-based system, suggesting that some previous works may overestimate the advantage of incorporating tra
Authors
(none)
Tags
Stats
Related papers
- Multistage Linguistic Conditioning Of Convolutional Layers For Speech Emotion Recognition (2021)9.23
- Cross-modal Fusion Techniques For Utterance-level Emotion Recognition From Text And Speech (2023)9.59
- Audio-guided Fusion Techniques For Multimodal Emotion Analysis (2024)4.52
- Interpretable Multimodal Emotion Recognition Using Hybrid Fusion Of Speech And Image Data (2022)11.85
- Multi-modal Emotion Recognition By Text, Speech And Video Using Pretrained Transformers (2024)0.00
- Multimodal Speech Emotion Recognition And Ambiguity Resolution (2019)0.00
- Deep Learning Based Emotion Recognition System Using Speech Features And Transcriptions (2019)0.00
- Fusing ASR Outputs In Joint Training For Speech Emotion Recognition (2021)12.61