MF-AED-AEC: Speech Emotion Recognition By Leveraging Multimodal Fusion, Asr Error Detection, And Asr Error Correction
2024 Β· Jiajun He, Xiaohan Shi, Xingfeng Li, et al.
Abstract
The prevalent approach in speech emotion recognition (SER) involves integrating both audio and textual information to comprehensively identify the speaker's emotion, with the text generally obtained through automatic speech recognition (ASR). An essential issue of this approach is that ASR errors from the text modality can worsen the performance of SER. Previous studies have proposed using an auxiliary ASR error detection task to adaptively assign weights of each word in ASR hypotheses. However, this approach has limited improvement potential because it does not address the coherence of semantic information in the text. Additionally, the inherent heterogeneity of different modalities leads to distribution gaps between their representations, making their fusion challenging. Therefore, in this paper, we incorporate two auxiliary tasks, ASR error detection (AED) and ASR error correction (AEC), to enhance the semantic coherence of ASR text, and further introduce a novel multi-modal fusion
Authors
(none)
Tags
Stats
Related papers
- Speech Emotion Recognition With ASR Transcripts: A Comprehensive Study On Word Error Rate And Fusion Techniques (2024)9.03
- Fusing ASR Outputs In Joint Training For Speech Emotion Recognition (2021)12.61
- ASR And Emotional Speech: A Word-level Investigation Of The Mutual Impact Of Speech And Emotion Recognition (2023)8.82
- MSF-SER: Enriching Acoustic Modeling With Multi-granularity Semantics For Speech Emotion Recognition (2025)0.00
- On The Impact Of Word Error Rate On Acoustic-linguistic Speech Emotion Recognition: An Update For The Deep Learning Era (2021)0.00
- Multi-task Semi-supervised Adversarial Autoencoding For Speech Emotion Recognition (2019)14.58
- Metadata-enhanced Speech Emotion Recognition: Augmented Residual Integration And Co-attention In Two-stage Fine-tuning (2024)5.24
- Cross-modal Fusion Techniques For Utterance-level Emotion Recognition From Text And Speech (2023)9.59