Interpretable Multimodal Emotion Recognition Using Hybrid Fusion Of Speech And Image Data
2022 Β· Puneet Kumar, Sarthak Malik, Balasubramanian Raman
Abstract
This paper proposes a multimodal emotion recognition system based on hybrid fusion that classifies the emotions depicted by speech utterances and corresponding images into discrete classes. A new interpretability technique has been developed to identify the important speech & image features leading to the prediction of particular emotion classes. The proposed system's architecture has been determined through intensive ablation studies. It fuses the speech & image features and then combines speech, image, and intermediate fusion outputs. The proposed interpretability technique incorporates the divide & conquer approach to compute shapely values denoting each speech & image feature's importance. We have also constructed a large-scale dataset (IIT-R SIER dataset), consisting of speech utterances, corresponding images, and class labels, i.e., 'anger,' 'happy,' 'hate,' and 'sad.' The proposed system has achieved 83.29% accuracy for emotion recognition. The enhanced performance of the propos
Authors
(none)
Tags
Stats
Related papers
- Fusion Approaches For Emotion Recognition From Speech Using Acoustic And Text-based Features (2024)12.25
- Multi-modal Emotion Recognition By Text, Speech And Video Using Pretrained Transformers (2024)0.00
- Multimodal Speech Emotion Recognition And Ambiguity Resolution (2019)0.00
- Audio-guided Fusion Techniques For Multimodal Emotion Analysis (2024)4.52
- Temporal Aggregation Of Audio-visual Modalities For Emotion Recognition (2020)8.09
- Multistage Linguistic Conditioning Of Convolutional Layers For Speech Emotion Recognition (2021)9.23
- Effmulti: Efficiently Modeling Complex Multimodal Interactions For Emotion Analysis (2022)0.00
- Multimodal Fusion With Deep Neural Networks For Audio-video Emotion Recognition (2019)0.00