Separator-transducer-segmenter: Streaming Recognition And Segmentation Of Multi-party Speech
2022 Β· Ilya Sklyar, Anna Piunova, Christian Osendorfer
Abstract
Streaming recognition and segmentation of multi-party conversations with overlapping speech is crucial for the next generation of voice assistant applications. In this work we address its challenges discovered in the previous work on multi-turn recurrent neural network transducer (MT-RNN-T) with a novel approach, separator-transducer-segmenter (STS), that enables tighter integration of speech separation, recognition and segmentation in a single model. First, we propose a new segmentation modeling strategy through start-of-turn and end-of-turn tokens that improves segmentation without recognition accuracy degradation. Second, we further improve both speech recognition and segmentation accuracy through an emission regularization method, FastEmit, and multi-task training with speech activity information as an additional training signal. Third, we experiment with end-of-turn emission latency penalty to improve end-point detection for each speaker turn. Finally, we establish a novel framewo
Authors
(none)
Tags
Stats
Related papers
- Multi-turn RNN-T For Streaming Recognition Of Multi-party Speech (2021)8.82
- Continuous Streaming Multi-talker ASR With Dual-path Transducers (2021)7.50
- Streaming End-to-end Multi-talker Speech Recognition (2020)11.49
- Streaming Multi-talker Speech Recognition With Joint Speaker Identification (2021)7.50
- Streaming Multi-speaker ASR With RNN-T (2020)10.07
- End-to-end Simultaneous Speech Translation With Differentiable Segmentation (2023)7.16
- End-to-end Single-channel Speaker-turn Aware Conversational Speech Translation (2023)2.26
- Speech Separation Based On Multi-stage Elaborated Dual-path Deep Bilstm With Auxiliary Identity Loss (2020)9.77