Learning Local To Global Feature Aggregation For Speech Emotion Recognition
2023 Β· Cheng Lu, Hailun Lian, Wenming Zheng, et al.
Abstract
Transformer has emerged in speech emotion recognition (SER) at present. However, its equal patch division not only damages frequency information but also ignores local emotion correlations across frames, which are key cues to represent emotion. To handle the issue, we propose a Local to Global Feature Aggregation learning (LGFA) for SER, which can aggregate longterm emotion correlations at different scales both inside frames and segments with entire frequency information to enhance the emotion discrimination of utterance-level speech features. For this purpose, we nest a Frame Transformer inside a Segment Transformer. Firstly, Frame Transformer is designed to excavate local emotion correlations between frames for frame embeddings. Then, the frame embeddings and their corresponding segment features are aggregated as different-level complements to be fed into Segment Transformer for learning utterance-level global emotion features. Experimental results show that the performance of LGFA i
Authors
(none)
Tags
Stats
Related papers
- Deep Residual Local Feature Learning For Speech Emotion Recognition (2020)7.16
- Time-frequency Transformer: A Novel Time Frequency Joint Learning Method For Speech Emotion Recognition (2023)5.84
- Speech Emotion Recognition With Global-aware Fusion On Multi-scale Feature Representation (2022)16.53
- Semi-supervised Cross-lingual Speech Emotion Recognition (2022)10.85
- Probing Speech Emotion Recognition Transformers For Linguistic Knowledge (2022)9.59
- Speech Emotion Recognition Via Cnn-transformer And Multidimensional Attention Mechanism (2024)0.00
- Leveraging Cross-attention Transformer And Multi-feature Fusion For Cross-linguistic Speech Emotion Recognition (2025)4.52
- Frame-level Emotional State Alignment Method For Speech Emotion Recognition (2023)8.60