Vararray Meets T-sot: Advancing The State Of The Art Of Streaming Distant Conversational Speech Recognition
2022 Β· Naoyuki Kanda, Jian Wu, Xiaofei Wang, et al.
Abstract
This paper presents a novel streaming automatic speech recognition (ASR) framework for multi-talker overlapping speech captured by a distant microphone array with an arbitrary geometry. Our framework, named t-SOT-VA, capitalizes on independently developed two recent technologies; array-geometry-agnostic continuous speech separation, or VarArray, and streaming multi-talker ASR based on token-level serialized output training (t-SOT). To combine the best of both technologies, we newly design a t-SOT-based ASR model that generates a serialized multi-talker transcription based on two separated speech signals from VarArray. We also propose a pre-training scheme for such an ASR model where we simulate VarArray's output signals based on monaural single-talker ASR training data. Conversation transcription experiments using the AMI meeting corpus show that the system based on the proposed framework significantly outperforms conventional ones. Our system achieves the state-of-the-art word error r
Authors
(none)
Tags
Stats
Related papers
- T-sot FNT: Streaming Multi-talker ASR With Text-only Domain Adaptation Capability (2023)3.58
- Stream Attention-based Multi-array End-to-end Speech Recognition (2018)0.00
- AV Taris: Online Audio-visual Speech Recognition (2020)0.00
- Continuous Streaming Multi-talker ASR With Dual-path Transducers (2021)7.50
- Multi-turn RNN-T For Streaming Recognition Of Multi-party Speech (2021)8.82
- A Comparative Study On Speaker-attributed Automatic Speech Recognition In Multi-party Meetings (2022)8.09
- VARA-TTS: Non-autoregressive Text-to-speech Synthesis Based On Very Deep VAE With Residual Attention (2021)0.00
- A Sidecar Separator Can Convert A Single-talker Speech Recognition System To A Multi-talker One (2023)9.03