Multi-microphone Speech Emotion Recognition Using The Hierarchical Token-semantic Audio Transformer Architecture
2024 Β· Ohad Cohen, Gershon Hazan, Sharon Gannot
Abstract
The performance of most emotion recognition systems degrades in real-life situations ('in the wild' scenarios) where the audio is contaminated by reverberation. Our study explores new methods to alleviate the performance degradation of SER algorithms and develop a more robust system for adverse conditions. We propose processing multi-microphone signals to address these challenges and improve emotion classification accuracy. We adopt a state-of-the-art transformer model, the HTS-AT, to handle multi-channel audio inputs. We evaluate two strategies: averaging mel-spectrograms across channels and summing patch-embedded representations. Our multi-microphone model achieves superior performance compared to single-channel baselines when tested on real-world reverberant environments.
Authors
(none)
Tags
Stats
Related papers
- Dawn Of The Transformer Era In Speech Emotion Recognition: Closing The Valence Gap (2022)18.59
- Leveraging Cross-attention Transformer And Multi-feature Fusion For Cross-linguistic Speech Emotion Recognition (2025)4.52
- Emoformer: A Text-independent Speech Emotion Recognition Using A Hybrid Transformer-cnn Model (2025)6.34
- Cross-language Speech Emotion Recognition Using Multimodal Dual Attention Transformers (2023)0.00
- Key-sparse Transformer For Multimodal Speech Emotion Recognition (2021)13.50
- Speech Emotion Recognition Via Cnn-transformer And Multidimensional Attention Mechanism (2024)0.00
- Decoding Emotions: A Comprehensive Multilingual Study Of Speech Models For Speech Emotion Recognition (2023)0.00
- Multilingual Speech Emotion Recognition With Multi-gating Mechanism And Neural Architecture Search (2022)2.26