Speech Swin-transformer: Exploring A Hierarchical Transformer With Shifted Windows For Speech Emotion Recognition
2024 Β· Yong Wang, Cheng Lu, Hailun Lian, et al.
Abstract
Swin-Transformer has demonstrated remarkable success in computer vision by leveraging its hierarchical feature representation based on Transformer. In speech signals, emotional information is distributed across different scales of speech features, e.\,g., word, phrase, and utterance. Drawing above inspiration, this paper presents a hierarchical speech Transformer with shifted windows to aggregate multi-scale emotion features for speech emotion recognition (SER), called Speech Swin-Transformer. Specifically, we first divide the speech spectrogram into segment-level patches in the time domain, composed of multiple frame patches. These segment-level patches are then encoded using a stack of Swin blocks, in which a local window Transformer is utilized to explore local inter-frame emotional information across frame patches of each segment patch. After that, we also design a shifted window Transformer to compensate for patch correlations near the boundaries of segment patches. Finally, we em
Authors
(none)
Tags
Stats
Related papers
- Swinlip: An Efficient Visual Speech Encoder For Lip Reading Using Swin Transformer (2025)6.77
- Multi-microphone Speech Emotion Recognition Using The Hierarchical Token-semantic Audio Transformer Architecture (2024)5.24
- Sigwavnet: Learning Multiresolution Signal Wavelet Network For Speech Emotion Recognition (2025)8.48
- Dawn Of The Transformer Era In Speech Emotion Recognition: Closing The Valence Gap (2022)18.59
- Speech Emotion Recognition Via Cnn-transformer And Multidimensional Attention Mechanism (2024)0.00
- Speechformer: A Hierarchical Efficient Framework Incorporating The Characteristics Of Speech (2022)12.99
- Time-frequency Transformer: A Novel Time Frequency Joint Learning Method For Speech Emotion Recognition (2023)5.84
- Learning Local To Global Feature Aggregation For Speech Emotion Recognition (2023)8.09