Whose Emotion Matters? Speaking Activity Localisation Without Prior Knowledge
2022 Β· Hugo Carneiro, Cornelius Weber, Stefan Wermter
Abstract
The task of emotion recognition in conversations (ERC) benefits from the availability of multiple modalities, as provided, for example, in the video-based Multimodal EmotionLines Dataset (MELD). However, only a few research approaches use both acoustic and visual information from the MELD videos. There are two reasons for this: First, label-to-video alignments in MELD are noisy, making those videos an unreliable source of emotional speech data. Second, conversations can involve several people in the same scene, which requires the localisation of the utterance source. In this paper, we introduce MELD with Fixed Audiovisual Information via Realignment (MELD-FAIR) by using recent active speaker detection and automatic speech recognition models, we are able to realign the videos of MELD and capture the facial expressions from speakers in 96.92% of the utterances provided in MELD. Experiments with a self-supervised voice recognition model indicate that the realigned MELD-FAIR videos more cl
Authors
(none)
Tags
Stats
Related papers
- Quality-controlled Multimodal Emotion Recognition In Conversations With Identity-based Transfer Learning And MAMBA Fusion (2025)0.00
- MIAR: Modality Interaction And Alignment Representation Fuison For Multimodal Emotion (2026)0.00
- Bemerc: Behavior-aware Mllm-based Framework For Multimodal Emotion Recognition In Conversation (2025)0.00
- Beyond Silent Letters: Amplifying Llms In Emotion Recognition With Vocal Nuances (2024)9.23
- Enhancing Modal Fusion By Alignment And Label Matching For Multimodal Emotion Recognition (2024)6.34
- Learning Alignment For Multimodal Emotion Recognition From Speech (2019)15.22
- MMER: Multimodal Multi-task Learning For Speech Emotion Recognition (2022)10.07
- Semantic Matters: Multimodal Features For Affective Analysis (2025)0.00