LLM Supervised Pre-training For Multimodal Emotion Recognition In Conversations
2025 Β· Soumya Dutta, Sriram Ganapathy
Abstract
Emotion recognition in conversations (ERC) is challenging due to the multimodal nature of the emotion expression. In this paper, we propose to pretrain a text-based recognition model from unsupervised speech transcripts with LLM guidance. These transcriptions are obtained from a raw speech dataset with a pre-trained ASR system. A text LLM model is queried to provide pseudo-labels for these transcripts, and these pseudo-labeled transcripts are subsequently used for learning an utterance level text-based emotion recognition model. We use the utterance level text embeddings for emotion recognition in conversations along with speech embeddings obtained from a recently proposed pre-trained model. A hierarchical way of training the speech-text model is proposed, keeping in mind the conversational nature of the dataset. We perform experiments on three established datasets, namely, IEMOCAP, MELD, and CMU- MOSI, where we illustrate that the proposed model improves over other benchmarks and achi
Authors
(none)
Tags
Stats
Related papers
- MMER: Multimodal Multi-task Learning For Speech Emotion Recognition (2022)10.07
- Bemerc: Behavior-aware Mllm-based Framework For Multimodal Emotion Recognition In Conversation (2025)0.00
- Multimodal Emotion Recognition And Sentiment Analysis In Multi-party Conversation Contexts (2025)0.00
- Jointly Fine-tuning "bert-like" Self Supervised Models To Improve Multimodal Speech Emotion Recognition (2020)13.74
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00
- Revise, Reason, And Recognize: Llm-based Emotion Recognition Via Emotion-specific Prompts And ASR Error Correction (2024)7.81
- Gatedxlstm: A Multimodal Affective Computing Approach For Emotion Recognition In Conversations (2025)0.00
- Leveraging Speech PTM, Text LLM, And Emotional TTS For Speech Emotion Recognition (2023)10.97