Speech Emotion Recognition With ASR Transcripts: A Comprehensive Study On Word Error Rate And Fusion Techniques
2024 Β· Yuanchao Li, Peter Bell, Catherine Lai
Abstract
Text data is commonly utilized as a primary input to enhance Speech Emotion Recognition (SER) performance and reliability. However, the reliance on human-transcribed text in most studies impedes the development of practical SER systems, creating a gap between in-lab research and real-world scenarios where Automatic Speech Recognition (ASR) serves as the text source. Hence, this study benchmarks SER performance using ASR transcripts with varying Word Error Rates (WERs) from eleven models on three well-known corpora: IEMOCAP, CMU-MOSI, and MSP-Podcast. Our evaluation includes both text-only and bimodal SER with six fusion techniques, aiming for a comprehensive analysis that uncovers novel findings and challenges faced by current SER research. Additionally, we propose a unified ASR error-robust framework integrating ASR error correction and modality-gated fusion, achieving lower WER and higher SER results compared to the best-performing ASR transcript. These findings provide insights into
Authors
(none)
Tags
Stats
Related papers
- On The Impact Of Word Error Rate On Acoustic-linguistic Speech Emotion Recognition: An Update For The Deep Learning Era (2021)0.00
- ASR And Emotional Speech: A Word-level Investigation Of The Mutual Impact Of Speech And Emotion Recognition (2023)8.82
- MF-AED-AEC: Speech Emotion Recognition By Leveraging Multimodal Fusion, Asr Error Detection, And Asr Error Correction (2024)0.00
- Fusing ASR Outputs In Joint Training For Speech Emotion Recognition (2021)12.61
- Cross-modal ASR Post-processing System For Error Correction And Utterance Rejection (2022)0.00
- A Change Of Heart: Improving Speech Emotion Recognition Through Speech-to-text Modality Conversion (2023)0.00
- On The Efficacy And Noise-robustness Of Jointly Learned Speech Emotion And Automatic Speech Recognition (2023)3.58
- Automatic Speech Recognition System-independent Word Error Rate Estimation (2024)3.58