Conformer-based Target-speaker Automatic Speech Recognition For Single-channel Audio
2023 Β· Yang Zhang, Krishna C. Puvvada, Vitaly Lavrukhin, et al.
Abstract
We propose CONF-TSASR, a non-autoregressive end-to-end time-frequency domain architecture for single-channel target-speaker automatic speech recognition (TS-ASR). The model consists of a TitaNet based speaker embedding module, a Conformer based masking as well as ASR modules. These modules are jointly optimized to transcribe a target-speaker, while ignoring speech from other speakers. For training we use Connectionist Temporal Classification (CTC) loss and introduce a scale-invariant spectrogram reconstruction loss to encourage the model better separate the target-speaker's spectrogram from mixture. We obtain state-of-the-art target-speaker word error rate (TS-WER) on WSJ0-2mix-extr (4.2%). Further, we report for the first time TS-WER on WSJ0-3mix-extr (12.4%), LibriSpeech2Mix (4.2%) and LibriSpeech3Mix (7.6%) datasets, establishing new benchmarks for TS-ASR. The proposed model will be open-sourced through NVIDIA NeMo toolkit.
Authors
(none)
Tags
Stats
Related papers
- META-CAT: Speaker-informed Speech Embeddings Via Meta Information Concatenation For Multi-talker ASR (2024)3.58
- End-to-end Multichannel Speaker-attributed ASR: Speaker Guided Decoder And Input Feature Analysis (2023)0.00
- Multi-speaker ASR Combining Non-autoregressive Conformer CTC And Conditional Speaker Chain (2021)11.31
- Streaming Multi-speaker ASR With RNN-T (2020)10.07
- Speaker-conditioning Single-channel Target Speaker Extraction Using Conformer-based Architectures (2022)6.34
- X-tasnet: Robust And Accurate Time-domain Speaker Extraction Network (2020)10.48
- ASAPP-ASR: Multistream CNN And Self-attentive SRU For SOTA Speech Recognition (2020)9.03
- A CTC Alignment-based Non-autoregressive Transformer For End-to-end Automatic Speech Recognition (2023)10.97