Improving Speech Emotion Recognition With Mutual Information Regularized Generative Model
2025 Β· Chung-Soo Ahn, Rajib Rana, Sunil Sivadas, et al.
Abstract
Lack of large, well-annotated emotional speech corpora continues to limit the performance and robustness of speech emotion recognition (SER), particularly as models grow more complex and the demand for multimodal systems increases. While generative data augmentation offers a promising solution, existing approaches often produce emotionally inconsistent samples due to oversimplified conditioning on categorical labels. This paper introduces a novel mutual-information-regularised generative framework that combines cross-modal alignment with feature-level synthesis. Building on an InfoGAN-style architecture, our method first learns a semantically aligned audio-text representation space using pre-trained transformers and contrastive objectives. A feature generator is then trained to produce emotion-aware audio features while employing mutual information as a quantitative regulariser to ensure strong dependency between generated features and their conditioning variables. We extend this appro
Authors
(none)
Tags
Stats
Related papers
- Augmenting Generative Adversarial Networks For Speech Emotion Recognition (2020)10.85
- On Enhancing Speech Emotion Recognition Using Generative Adversarial Networks (2018)12.33
- Modeling Feature Representations For Affective Speech Using Generative Adversarial Networks (2019)0.00
- Generative Data Augmentation Guided By Triplet Loss For Speech Emotion Recognition (2022)3.58
- Learning Representations Of Emotional Speech With Deep Convolutional Generative Adversarial Networks (2017)0.00
- Generative Emotional AI For Speech Emotion Recognition: The Case For Synthetic Emotional Speech Augmentation (2023)11.19
- Audio-guided Fusion Techniques For Multimodal Emotion Analysis (2024)4.52
- Contrastive Regularization For Multimodal Emotion Recognition Using Audio And Text (2022)0.00