Building A Great Multi-lingual Teacher With Sparsely-gated Mixture Of Experts For Speech Recognition
2021 Β· Kenichi Kumatani, Robert Gmyr, Felipe Cruz Salinas, et al.
Abstract
The sparsely-gated Mixture of Experts (MoE) can magnify a network capacity with a little computational complexity. In this work, we investigate how multi-lingual Automatic Speech Recognition (ASR) networks can be scaled up with a simple routing algorithm in order to achieve better accuracy. More specifically, we apply the sparsely-gated MoE technique to two types of networks: Sequence-to-Sequence Transformer (S2S-T) and Transformer Transducer (T-T). We demonstrate through a set of ASR experiments on multiple language data that the MoE networks can reduce the relative word error rates by 16.3% and 4.6% with the S2S-T and T-T, respectively. Moreover, we thoroughly investigate the effect of the MoE on the T-T architecture in various conditions: streaming mode, non-streaming mode, the use of language ID and the label decoder with the MoE.
Authors
(none)
Tags
Stats
Related papers
- Speechmoe: Scaling To Large Acoustic Models With Dynamic Routing Mixture Of Experts (2021)10.97
- Language-routing Mixture Of Experts For Multilingual And Code-switching Speech Recognition (2023)9.03
- UME: Upcycling Mixture-of-experts For Scalable And Efficient Automatic Speech Recognition (2024)2.26
- Sc-moe: Switch Conformer Mixture Of Experts For Unified Streaming And Non-streaming Code-switching ASR (2024)6.77
- Mole : Mixture Of Language Experts For Multi-lingual Automatic Speech Recognition (2023)9.41
- Handling Trade-offs In Speech Separation With Sparsely-gated Mixture Of Experts (2022)0.00
- Hdmole: Mixture Of Lora Experts With Hierarchical Routing And Dynamic Thresholds For Fine-tuning Llm-based ASR Models (2024)8.09
- Ba-moe: Boundary-aware Mixture-of-experts Adapter For Code-switching Speech Recognition (2023)7.50