Handling Trade-offs In Speech Separation With Sparsely-gated Mixture Of Experts
2022 Β· Xiaofei Wang, Zhuo Chen, Yu Shi, et al.
Abstract
Employing a monaural speech separation (SS) model as a front-end for automatic speech recognition (ASR) involves balancing two kinds of trade-offs. First, while a larger model improves the SS performance, it also requires a higher computational cost. Second, an SS model that is more optimized for handling overlapped speech is likely to introduce more processing artifacts in non-overlapped-speech regions. In this paper, we address these trade-offs with a sparsely-gated mixture-of-experts (MoE) architecture. Comprehensive evaluation results obtained using both simulated and real meeting recordings show that our proposed sparsely-gated MoE SS model achieves superior separation capabilities with less speech distortion, while involving only a marginal run-time cost increase.
Authors
(none)
Tags
Stats
Related papers
- Transcription-free Fine-tuning Of Speech Separation Models For Noisy And Reverberant Multi-speaker Automatic Speech Recognition (2024)3.58
- Investigation Of Practical Aspects Of Single Channel Speech Separation For ASR (2021)7.81
- Building A Great Multi-lingual Teacher With Sparsely-gated Mixture Of Experts For Speech Recognition (2021)0.00
- Unified Modeling Of Multi-talker Overlapped Speech Recognition And Diarization With A Sidecar Separator (2023)7.50
- An Initialization Scheme For Meeting Separation With Spatial Mixture Models (2022)7.16
- Elevating Robust Multi-talker ASR By Decoupling Speaker Separation And Speech Recognition (2025)0.00
- UME: Upcycling Mixture-of-experts For Scalable And Efficient Automatic Speech Recognition (2024)2.26
- Unifying Speech Enhancement And Separation With Gradient Modulation For End-to-end Noise-robust Speech Separation (2023)0.00