Multistage Linguistic Conditioning Of Convolutional Layers For Speech Emotion Recognition
2021 Β· Andreas Triantafyllopoulos, Uwe Reichel, Shuo Liu, et al.
Abstract
In this contribution, we investigate the effectiveness of deep fusion of text and audio features for categorical and dimensional speech emotion recognition (SER). We propose a novel, multistage fusion method where the two information streams are integrated in several layers of a deep neural network (DNN), and contrast it with a single-stage one where the streams are merged in a single point. Both methods depend on extracting summary linguistic embeddings from a pre-trained BERT model, and conditioning one or more intermediate representations of a convolutional model operating on log-Mel spectrograms. Experiments on the MSP-Podcast and IEMOCAP datasets demonstrate that the two fusion methods clearly outperform a shallow (late) fusion baseline and their unimodal constituents, both in terms of quantitative performance and qualitative behaviour. Overall, our multistage fusion shows better quantitative performance, surpassing alternatives on most of our evaluations. This illustrates the pot
Authors
(none)
Tags
Stats
Related papers
- Fusion Approaches For Emotion Recognition From Speech Using Acoustic And Text-based Features (2024)12.25
- Multimodal Fusion With Deep Neural Networks For Audio-video Emotion Recognition (2019)0.00
- MSF-SER: Enriching Acoustic Modeling With Multi-granularity Semantics For Speech Emotion Recognition (2025)0.00
- Two-stage Dimensional Emotion Recognition By Fusing Predictions Of Acoustic And Text Networks Using SVM (2022)12.10
- Audio-guided Fusion Techniques For Multimodal Emotion Analysis (2024)4.52
- Emotech: A Multi-modal Speech Emotion Recognition Using Multi-source Low-level Information With Hybrid Recurrent Network (2025)8.35
- Jointly Fine-tuning "bert-like" Self Supervised Models To Improve Multimodal Speech Emotion Recognition (2020)13.74
- Interpretable Multimodal Emotion Recognition Using Hybrid Fusion Of Speech And Image Data (2022)11.85