Streaming End-to-end Multi-talker Speech Recognition
2020 Β· Liang Lu, Naoyuki Kanda, Jinyu Li, et al.
Abstract
End-to-end multi-talker speech recognition is an emerging research trend in the speech community due to its vast potential in applications such as conversation and meeting transcriptions. To the best of our knowledge, all existing research works are constrained in the offline scenario. In this work, we propose the Streaming Unmixing and Recognition Transducer (SURT) for end-to-end multi-talker speech recognition. Our model employs the Recurrent Neural Network Transducer (RNN-T) as the backbone that can meet various latency constraints. We study two different model architectures that are based on a speaker-differentiator encoder and a mask encoder respectively. To train this model, we investigate the widely used Permutation Invariant Training (PIT) approach and the Heuristic Error Assignment Training (HEAT) approach. Based on experiments on the publicly available LibriSpeechMix dataset, we show that HEAT can achieve better accuracy compared with PIT, and the SURT model with 150 millisec
Authors
(none)
Tags
Stats
Related papers
- Streaming Multi-talker Speech Recognition With Joint Speaker Identification (2021)7.50
- Continuous Streaming Multi-talker ASR With Dual-path Transducers (2021)7.50
- Multi-turn RNN-T For Streaming Recognition Of Multi-party Speech (2021)8.82
- Streaming Multi-speaker ASR With RNN-T (2020)10.07
- Separator-transducer-segmenter: Streaming Recognition And Segmentation Of Multi-party Speech (2022)0.00
- Single-channel Multi-talker Speech Recognition With Permutation Invariant Training (2017)12.10
- Transformer Transducer: A Streamable Speech Recognition Model With Transformer Encoders And RNN-T Loss (2020)18.58
- Multitask Learning And Joint Optimization For Transformer-rnn-transducer Speech Recognition (2020)8.09