Joint Learning Using Mixture-of-expert-based Representation For Speech Enhancement And Robust Emotion Recognition
2026 Β· Jing-Tong Tzeng, Carlos Busso, Chi-Chun Lee
Abstract
arXiv:2509.08470v2 Announce Type: replace Abstract: Speech emotion recognition (SER) plays a critical role in building emotion-aware speech systems, but its performance degrades significantly under noisy conditions. Although speech enhancement (SE) can improve robustness, it often introduces artifacts that obscure emotional cues and adds computational overhead to the pipeline. Multi-task learning (MTL) offers an alternative by jointly optimizing SE and SER tasks. However, conventional shared-backbone models frequently suffer from gradient interference and representational conflicts between tasks. To address these challenges, we propose the Sparse Mixture-of-Experts Representation Integration Technique (Sparse MERIT), a flexible MTL framework that applies frame-wise expert routing over self-supervised speech representations. Sparse MERIT incorporates task-specific gating networks that dynamically select from a shared pool of experts for each frame, enabling parameter-efficient and task
Authors
(none)
Tags
Stats
Related papers
- On The Efficacy And Noise-robustness Of Jointly Learned Speech Emotion And Automatic Speech Recognition (2023)3.58
- Towards Speech Emotion Recognition "in The Wild" Using Aggregated Corpora And Deep Multi-task Learning (2017)12.87
- MMER: Multimodal Multi-task Learning For Speech Emotion Recognition (2022)10.07
- ML-SAN: Multi-level Speaker-adaptive Network For Emotion Recognition In Conversations (2026)0.00
- Metadata-enhanced Speech Emotion Recognition: Augmented Residual Integration And Co-attention In Two-stage Fine-tuning (2024)5.24
- MSF-SER: Enriching Acoustic Modeling With Multi-granularity Semantics For Speech Emotion Recognition (2025)0.00
- Speecheq: Speech Emotion Recognition Based On Multi-scale Unified Datasets And Multitask Learning (2022)5.84
- Learning Discriminative Features Using Center Loss And Reconstruction As Regularizer For Speech Emotion Recognition (2019)0.00