META-CAT: Speaker-informed Speech Embeddings Via Meta Information Concatenation For Multi-talker ASR
2024 Β· Jinhan Wang, Weiqing Wang, Kunal Dhawan, et al.
Abstract
We propose a novel end-to-end multi-talker automatic speech recognition (ASR) framework that enables both multi-speaker (MS) ASR and target-speaker (TS) ASR. Our proposed model is trained in a fully end-to-end manner, incorporating speaker supervision from a pre-trained speaker diarization module. We introduce an intuitive yet effective method for masking ASR encoder activations using output from the speaker supervision module, a technique we term Meta-Cat (meta-information concatenation), that can be applied to both MS-ASR and TS-ASR. Our results demonstrate that the proposed architecture achieves competitive performance in both MS-ASR and TS-ASR tasks, without the need for traditional methods, such as neural mask estimation or masking at the audio or feature level. Furthermore, we demonstrate a glimpse of a unified dual-task model which can efficiently handle both MS-ASR and TS-ASR tasks. Thus, this work illustrates that a robust end-to-end multi-talker ASR framework can be implement
Authors
(none)
Tags
Stats
Related papers
- End-to-end Multichannel Speaker-attributed ASR: Speaker Guided Decoder And Input Feature Analysis (2023)0.00
- Multi-encoder Multi-resolution Framework For End-to-end Speech Recognition (2018)0.00
- A Comparative Study On Multichannel Speaker-attributed Automatic Speech Recognition In Multi-party Meetings (2022)5.24
- Conformer-based Target-speaker Automatic Speech Recognition For Single-channel Audio (2023)9.41
- End-to-end Monaural Multi-speaker ASR System Without Pretraining (2018)11.93
- Improved Mask-ctc For Non-autoregressive End-to-end ASR (2020)11.76
- CAT: A CTC-CRF Based ASR Toolkit Bridging The Hybrid And The End-to-end Approaches Towards Data Efficiency And Low Latency (2020)9.03
- Exploring End-to-end Multi-channel ASR With Bias Information For Meeting Transcription (2020)7.16