Knowledge Distillation For Neural Transducer-based Target-speaker ASR: Exploiting Parallel Mixture/single-talker Speech Data
2023 Β· Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, et al.
Abstract
Neural transducer (RNNT)-based target-speaker speech recognition (TS-RNNT) directly transcribes a target speaker's voice from a multi-talker mixture. It is a promising approach for streaming applications because it does not incur the extra computation costs of a target speech extraction frontend, which is a critical barrier to quick response. TS-RNNT is trained end-to-end given the input speech (i.e., mixtures and enrollment speech) and reference transcriptions. The training mixtures are generally simulated by mixing single-talker signals, but conventional TS-RNNT training does not utilize single-speaker signals. This paper proposes using knowledge distillation (KD) to exploit the parallel mixture/single-talker speech data. Our proposed KD scheme uses an RNNT system pretrained with the target single-talker speech input to generate pseudo labels for the TS-RNNT training. Experimental results show that TS-RNNT systems trained with the proposed KD scheme outperform a baseline TS-RNNT.
Authors
(none)
Tags
Stats
Related papers
- Reducing The Gap Between Streaming And Non-streaming Transducer-based ASR By Adaptive Two-stage Knowledge Distillation (2023)4.52
- Inter-kd: Intermediate Knowledge Distillation For Ctc-based Automatic Speech Recognition (2022)7.50
- Decouple Non-parametric Knowledge Distillation For End-to-end Speech Translation (2023)0.00
- Knowledge Transfer And Distillation From Autoregressive To Non-autoregressive Speech Recognition (2022)0.00
- Streaming Multi-speaker ASR With RNN-T (2020)10.07
- Source And Target Bidirectional Knowledge Distillation For End-to-end Speech Translation (2021)9.03
- Mutual Learning Of Single- And Multi-channel End-to-end Neural Diarization (2022)3.58
- Distilling Knowledge From Ensembles Of Acoustic Models For Joint Ctc-attention End-to-end Speech Recognition (2020)8.09