Multimodal Speech Emotion Recognition Using Audio And Text
2018 Β· Seunghyun Yoon, Seokhyun Byun, Kyomin Jung
Abstract
Speech emotion recognition is a challenging task, and extensive reliance has been placed on models that use audio features in building well-performing classifiers. In this paper, we propose a novel deep dual recurrent encoder model that utilizes text data and audio signals simultaneously to obtain a better understanding of speech data. As emotional dialogue is composed of sound and spoken content, our model encodes the information from audio and text sequences using dual recurrent neural networks (RNNs) and then combines the information from these sources to predict the emotion class. This architecture analyzes speech data from the signal level to the language level, and it thus utilizes the information within the data more comprehensively than models that focus on audio features. Extensive experiments are conducted to investigate the efficacy and properties of the proposed model. Our proposed model outperforms previous state-of-the-art methods in assigning data to one of four emotion
Authors
(none)
Tags
Stats
Related papers
- Contrastive Regularization For Multimodal Emotion Recognition Using Audio And Text (2022)0.00
- Learning Alignment For Multimodal Emotion Recognition From Speech (2019)15.22
- Multimodal Speech Emotion Recognition And Ambiguity Resolution (2019)0.00
- Emotech: A Multi-modal Speech Emotion Recognition Using Multi-source Low-level Information With Hybrid Recurrent Network (2025)8.35
- Multimodal Fusion With Deep Neural Networks For Audio-video Emotion Recognition (2019)0.00
- Multimodal Emotion Recognition And Sentiment Analysis In Multi-party Conversation Contexts (2025)0.00
- Speech Emotion Recognition Using Multi-hop Attention Mechanism (2019)14.58
- Multi-modal Emotion Recognition By Text, Speech And Video Using Pretrained Transformers (2024)0.00