Multi-turn RNN-T For Streaming Recognition Of Multi-party Speech
2021 Β· Ilya Sklyar, Anna Piunova, Xianrui Zheng, et al.
Abstract
Automatic speech recognition (ASR) of single channel far-field recordings with an unknown number of speakers is traditionally tackled by cascaded modules. Recent research shows that end-to-end (E2E) multi-speaker ASR models can achieve superior recognition accuracy compared to modular systems. However, these models do not ensure real-time applicability due to their dependency on full audio context. This work takes real-time applicability as the first priority in model design and addresses a few challenges in previous work on multi-speaker recurrent neural network transducer (MS-RNN-T). First, we introduce on-the-fly overlapping speech simulation during training, yielding 14% relative word error rate (WER) improvement on LibriSpeechMix test set. Second, we propose a novel multi-turn RNN-T (MT-RNN-T) model with an overlap-based target arrangement strategy that generalizes to an arbitrary number of speakers without changes in the model architecture. We investigate the impact of the maximu
Authors
(none)
Tags
Stats
Related papers
- Streaming Multi-speaker ASR With RNN-T (2020)10.07
- Separator-transducer-segmenter: Streaming Recognition And Segmentation Of Multi-party Speech (2022)0.00
- Continuous Streaming Multi-talker ASR With Dual-path Transducers (2021)7.50
- Streaming End-to-end Multi-talker Speech Recognition (2020)11.49
- Streaming Multi-talker Speech Recognition With Joint Speaker Identification (2021)7.50
- Multi-stream End-to-end Speech Recognition (2019)8.35
- Investigation Of End-to-end Speaker-attributed ASR For Continuous Multi-talker Recordings (2020)10.35
- Bridging The Gap Between Streaming And Non-streaming ASR Systems Bydistilling Ensembles Of CTC And RNN-T Models (2021)3.58