Joint Learning Using Mixture-of-expert-based Representation For Speech Enhancement And Robust Emotion Recognition

Abstract

arXiv:2509.08470v2 Announce Type: replace Abstract: Speech emotion recognition (SER) plays a critical role in building emotion-aware speech systems, but its performance degrades significantly under noisy conditions. Although speech enhancement (SE) can improve robustness, it often introduces artifacts that obscure emotional cues and adds computational overhead to the pipeline. Multi-task learning (MTL) offers an alternative by jointly optimizing SE and SER tasks. However, conventional shared-backbone models frequently suffer from gradient interference and representational conflicts between tasks. To address these challenges, we propose the Sparse Mixture-of-Experts Representation Integration Technique (Sparse MERIT), a flexible MTL framework that applies frame-wise expert routing over self-supervised speech representations. Sparse MERIT incorporates task-specific gating networks that dynamically select from a shared pool of experts for each frame, enabling parameter-efficient and task

Joint Learning Using Mixture-of-expert-based Representation For Speech Enhancement And Robust Emotion Recognition

Abstract

Authors

Tags

Stats

Related papers