MIMO-SPEECH: End-to-end Multi-channel Multi-speaker Speech Recognition
2019 Β· Xuankai Chang, Wangyou Zhang, Yanmin Qian, et al.
Abstract
Recently, the end-to-end approach has proven its efficacy in monaural multi-speaker speech recognition. However, high word error rates (WERs) still prevent these systems from being used in practical applications. On the other hand, the spatial information in multi-channel signals has proven helpful in far-field speech recognition tasks. In this work, we propose a novel neural sequence-to-sequence (seq2seq) architecture, MIMO-Speech, which extends the original seq2seq to deal with multi-channel input and multi-channel output so that it can fully model multi-channel multi-speaker speech separation and recognition. MIMO-Speech is a fully neural end-to-end framework, which is optimized only via an ASR criterion. It is comprised of: 1) a monaural masking network, 2) a multi-source neural beamformer, and 3) a multi-output speech recognition model. With this processing, the input overlapped speech is directly mapped to text sequences. We further adopted a curriculum learning strategy, making
Authors
(none)
Tags
Stats
Related papers
- A Purely End-to-end System For Multi-speaker Speech Recognition (2018)12.25
- Survey Of End-to-end Multi-speaker Automatic Speech Recognition For Monaural Audio (2025)2.26
- End-to-end Monaural Multi-speaker ASR System Without Pretraining (2018)11.93
- An Investigation Of End-to-end Multichannel Speech Recognition For Reverberant And Mismatch Conditions (2019)0.00
- End-to-end Dereverberation, Beamforming, And Speech Recognition With Improved Numerical Stability And Advanced Frontend (2021)10.97
- Exploiting Single-channel Speech For Multi-channel End-to-end Speech Recognition (2021)0.00
- Mimo-dbnet: Multi-channel Input And Multiple Outputs Doa-aware Beamforming Network For Speech Separation (2022)0.00
- End-to-end Multichannel Speaker-attributed ASR: Speaker Guided Decoder And Input Feature Analysis (2023)0.00