Learning Alignment For Multimodal Emotion Recognition From Speech
2019 Β· Haiyang Xu, Hui Zhang, Kun Han, et al.
Abstract
Speech emotion recognition is a challenging problem because human convey emotions in subtle and complex ways. For emotion recognition on human speech, one can either extract emotion related features from audio signals or employ speech recognition techniques to generate text from speech and then apply natural language processing to analyze the sentiment. Further, emotion recognition will be beneficial from using audio-textual multimodal information, it is not trivial to build a system to learn from multimodality. One can build models for two input sources separately and combine them in a decision level, but this method ignores the interaction between speech and text in the temporal domain. In this paper, we propose to use an attention mechanism to learn the alignment between speech frames and text words, aiming to produce more accurate multimodal feature representations. The aligned multimodal features are fed into a sequential model for emotion recognition. We evaluate the approach on
Authors
(none)
Tags
Stats
Related papers
- Multimodal Speech Emotion Recognition And Ambiguity Resolution (2019)0.00
- Multimodal Speech Emotion Recognition Using Audio And Text (2018)18.02
- Contrastive Regularization For Multimodal Emotion Recognition Using Audio And Text (2022)0.00
- Speech Emotion Recognition Using Multi-hop Attention Mechanism (2019)14.58
- Group Gated Fusion On Attention-based Bidirectional Alignment For Multimodal Emotion Recognition (2022)11.39
- Multimodal Speech Emotion Recognition Using Cross Attention With Aligned Audio And Text (2022)9.76
- Multimodal Emotion Recognition Using Transfer Learning From Speaker Recognition And Bert-based Models (2022)12.10
- Agent-based Modular Learning For Multimodal Emotion Recognition In Human-agent Systems (2025)0.00