Multi-speaker ASR Combining Non-autoregressive Conformer CTC And Conditional Speaker Chain
2021 Β· Pengcheng Guo, Xuankai Chang, Shinji Watanabe, et al.
Abstract
Non-autoregressive (NAR) models have achieved a large inference computation reduction and comparable results with autoregressive (AR) models on various sequence to sequence tasks. However, there has been limited research aiming to explore the NAR approaches on sequence to multi-sequence problems, like multi-speaker automatic speech recognition (ASR). In this study, we extend our proposed conditional chain model to NAR multi-speaker ASR. Specifically, the output of each speaker is inferred one-by-one using both the input mixture speech and previously-estimated conditional speaker features. In each step, a NAR connectionist temporal classification (CTC) encoder is used to perform parallel computation. With this design, the total inference steps will be restricted to the number of mixed speakers. Besides, we also adopt the Conformer and incorporate an intermediate CTC loss to improve the performance. Experiments on WSJ0-Mix and LibriMix corpora show that our model outperforms other NAR mo
Authors
(none)
Tags
Stats
Related papers
- Non-autoregressive End-to-end Approaches For Joint Automatic Speech Recognition And Spoken Language Understanding (2023)5.84
- Conformer-based Target-speaker Automatic Speech Recognition For Single-channel Audio (2023)9.41
- Streaming Multi-speaker ASR With RNN-T (2020)10.07
- Hierarchical Conditional End-to-end ASR With CTC And Multi-granular Subword Units (2021)9.23
- 3M: Multi-loss, Multi-path And Multi-level Neural Networks For Speech Recognition (2022)8.67
- Non-autoregressive Transformer ASR With Ctc-enhanced Decoder Input (2020)10.97
- End-to-end Monaural Multi-speaker ASR System Without Pretraining (2018)11.93
- Speaker Conditioning Of Acoustic Models Using Affine Transformation For Multi-speaker Speech Recognition (2021)0.00