Emotech: A Multi-modal Speech Emotion Recognition Using Multi-source Low-level Information With Hybrid Recurrent Network
2025 Β· Shamin Bin Habib Avro, Taieba Taher, Nursadul Mamun
Abstract
Emotion recognition is a critical task in human-computer interaction, enabling more intuitive and responsive systems. This study presents a multimodal emotion recognition system that combines low-level information from audio and text, leveraging both Convolutional Neural Networks (CNNs) and Bidirectional Long Short-Term Memory Networks (BiLSTMs). The proposed system consists of two parallel networks: an Audio Block and a Text Block. Mel Frequency Cepstral Coefficients (MFCCs) are extracted and processed by a BiLSTM network and a 2D convolutional network to capture low-level intrinsic and extrinsic features from speech. Simultaneously, a combined BiLSTM-CNN network extracts the low-level sequential nature of text from word embeddings corresponding to the available audio. This low-level information from speech and text is then concatenated and processed by several fully connected layers to classify the speech emotion. Experimental results demonstrate that the proposed EmoTech accurately
Authors
(none)
Tags
Stats
Related papers
- Emoformer: A Text-independent Speech Emotion Recognition Using A Hybrid Transformer-cnn Model (2025)6.34
- Multimodal Speech Emotion Recognition Using Audio And Text (2018)18.02
- Multi-modal Emotion Recognition By Text, Speech And Video Using Pretrained Transformers (2024)0.00
- Multistage Linguistic Conditioning Of Convolutional Layers For Speech Emotion Recognition (2021)9.23
- Emodiarize: Speaker Diarization And Emotion Identification From Speech Signals Using Convolutional Neural Networks (2023)0.00
- Speech Emotion Recognition Using Multi-hop Attention Mechanism (2019)14.58
- Multimodal Emotion Recognition And Sentiment Analysis In Multi-party Conversation Contexts (2025)0.00
- Deep Learning Based Emotion Recognition System Using Speech Features And Transcriptions (2019)0.00