Temporal-frequency State Space Duality: An Efficient Paradigm For Speech Emotion Recognition
2024 Β· Jiaqi Zhao, Fei Wang, Kun Li, et al.
Abstract
Speech Emotion Recognition (SER) plays a critical role in enhancing user experience within human-computer interaction. However, existing methods are overwhelmed by temporal domain analysis, overlooking the valuable envelope structures of the frequency domain that are equally important for robust emotion recognition. To overcome this limitation, we propose TF-Mamba, a novel multi-domain framework that captures emotional expressions in both temporal and frequency dimensions.Concretely, we propose a temporal-frequency mamba block to extract temporal- and frequency-aware emotional features, achieving an optimal balance between computational efficiency and model expressiveness. Besides, we design a Complex Metric-Distance Triplet (CMDT) loss to enable the model to capture representative emotional clues for SER. Extensive experiments on the IEMOCAP and MELD datasets show that TF-Mamba surpasses existing methods in terms of model size and latency, providing a more practical solution for futur
Authors
(none)
Tags
Stats
Related papers
- MSF-SER: Enriching Acoustic Modeling With Multi-granularity Semantics For Speech Emotion Recognition (2025)0.00
- Enhanced Speech Emotion Recognition With Efficient Channel Attention Guided Deep Cnn-bilstm Framework (2024)0.00
- Speech Emotion Recognition With Dual-sequence LSTM Architecture (2019)15.78
- Optimizing Speech Emotion Recognition Using Manta-ray Based Feature Selection (2020)0.00
- Transforming The Embeddings: A Lightweight Technique For Speech Emotion Recognition Tasks (2023)7.50
- Speecheq: Speech Emotion Recognition Based On Multi-scale Unified Datasets And Multitask Learning (2022)5.84
- Multi-channel Auto-encoder For Speech Emotion Recognition (2018)0.00
- Multistage Linguistic Conditioning Of Convolutional Layers For Speech Emotion Recognition (2021)9.23