MSF-SER: Enriching Acoustic Modeling With Multi-granularity Semantics For Speech Emotion Recognition
2025 Β· Haoxun Li, Yuqing Sun, Hanlei Shi, et al.
Abstract
Continuous dimensional speech emotion recognition captures affective variation along valence, arousal, and dominance, providing finer-grained representations than categorical approaches. Yet most multimodal methods rely solely on global transcripts, leading to two limitations: (1) all words are treated equally, overlooking that emphasis on different parts of a sentence can shift emotional meaning; (2) only surface lexical content is represented, lacking higher-level interpretive cues. To overcome these issues, we propose MSF-SER (Multi-granularity Semantic Fusion for Speech Emotion Recognition), which augments acoustic features with three complementary levels of textual semantics--Local Emphasized Semantics (LES), Global Semantics (GS), and Extended Semantics (ES). These are integrated via an intra-modal gated fusion and a cross-modal FiLM-modulated lightweight Mixture-of-Experts (FM-MOE). Experiments on MSP-Podcast and IEMOCAP show that MSF-SER consistently improves dimensional predic
Authors
(none)
Tags
Stats
Related papers
- MSAC: Multiple Speech Attribute Control Method For Reliable Speech Emotion Recognition (2023)0.00
- Multistage Linguistic Conditioning Of Convolutional Layers For Speech Emotion Recognition (2021)9.23
- MF-AED-AEC: Speech Emotion Recognition By Leveraging Multimodal Fusion, Asr Error Detection, And Asr Error Correction (2024)0.00
- Temporal-frequency State Space Duality: An Efficient Paradigm For Speech Emotion Recognition (2024)7.50
- Msemotts: Multi-scale Emotion Transfer, Prediction, And Control For Emotional Speech Synthesis (2022)13.97
- Speecheq: Speech Emotion Recognition Based On Multi-scale Unified Datasets And Multitask Learning (2022)5.84
- Semantic Matters: Multimodal Features For Affective Analysis (2025)0.00
- Msp-podcast SER Challenge 2024: L'antenne Du Ventoux Multimodal Self-supervised Learning For Speech Emotion Recognition (2024)5.84