Cuside-array: A Streaming Multi-channel End-to-end Speech Recognition System With Realistic Evaluations
2024 Β· Xiangzhu Kong, Tianqi Ning, Hao Huang, et al.
Abstract
Recently multi-channel end-to-end (ME2E) ASR systems have emerged. While streaming single-channel end-to-end ASR has been extensively studied, streaming ME2E ASR is limited in exploration. Additionally, recent studies call attention to the gap between in-distribution (ID) and out-of-distribution (OOD) tests and doing realistic evaluations. This paper focuses on two research problems: realizing streaming ME2E ASR and improving OOD generalization. We propose the CUSIDE-array method, which integrates the recent CUSIDE methodology (Chunking, Simulating Future Context and Decoding) into the neural beamformer approach of ME2E ASR. It enables streaming processing of both front-end and back-end with a total latency of 402ms. The CUSIDE-array ME2E models are shown to achieve superior streaming results in both ID and OOD tests. Realistic evaluations confirm the advantage of CUSIDE-array in its capability to consume single-channel data to improve OOD generalization via back-end pre-training and M
Authors
(none)
Tags
Stats
Related papers
- CUSIDE-T: Chunking, Simulating Future And Decoding For Transducer Based Streaming ASR (2024)2.26
- CUSIDE: Chunking, Simulating Future Context And Decoding For Streaming ASR (2022)7.50
- Stream Attention-based Multi-array End-to-end Speech Recognition (2018)0.00
- Multi-stream End-to-end Speech Recognition (2019)8.35
- Cascaded Encoders For Unifying Streaming And Non-streaming ASR (2020)12.47
- An Investigation Of End-to-end Multichannel Speech Recognition For Reverberant And Mismatch Conditions (2019)0.00
- Automatic Channel Selection And Spatial Feature Integration For Multi-channel Speech Recognition Across Various Array Topologies (2023)8.09
- Unified Streaming And Non-streaming Two-pass End-to-end Model For Speech Recognition (2020)0.00