Framewise Approach In Multimodal Emotion Recognition In OMG Challenge
2018 Β· Grigoriy Sterling, Andrey Belyaev, Maxim Ryabov
Abstract
In this report we described our approach achieves \(53%\) of unweighted accuracy over \(7\) emotions and \(0.05\) and \(0.09\) mean squared errors for arousal and valence in OMG emotion recognition challenge. Our results were obtained with ensemble of single modality models trained on voice and face data from video separately. We consider each stream as a sequence of frames. Next we estimated features from frames and handle it with recurrent neural network. As audio frame we mean short \(0.4\) second spectrogram interval. For features estimation for face pictures we used own ResNet neural network pretrained on AffectNet database. Each short spectrogram was considered as a picture and processed by convolutional network too. As a base audio model we used ResNet pretrained in speaker recognition task. Predictions from both modalities were fused on decision level and improve single-channel approaches by a few percent
Authors
(none)
Tags
Stats
Related papers
- Continuous Multimodal Emotion Recognition Approach For AVEC 2017 (2017)0.00
- Transformer For Emotion Recognition (2018)0.00
- Learning Alignment For Multimodal Emotion Recognition From Speech (2019)15.22
- Multimodal Emotion Recognition And Sentiment Analysis In Multi-party Conversation Contexts (2025)0.00
- Temporal Aggregation Of Audio-visual Modalities For Emotion Recognition (2020)8.09
- MMER: Multimodal Multi-task Learning For Speech Emotion Recognition (2022)10.07
- Enhancing Modal Fusion By Alignment And Label Matching For Multimodal Emotion Recognition (2024)6.34
- Effmulti: Efficiently Modeling Complex Multimodal Interactions For Emotion Analysis (2022)0.00