Fast Streaming Transducer ASR Prototyping Via Knowledge Distillation With Whisper
2024 Β· Iuliia Thorbecke, Juan Zuluaga-Gomez, EsaΓΊ Villatoro-Tello, et al.
Abstract
The training of automatic speech recognition (ASR) with little to no supervised data remains an open question. In this work, we demonstrate that streaming Transformer-Transducer (TT) models can be trained from scratch in consumer and accessible GPUs in their entirety with pseudo-labeled (PL) speech from foundational speech models (FSM). This allows training a robust ASR model just in one stage and does not require large data and computational budget compared to the two-step scenario with pre-training and fine-tuning. We perform a comprehensive ablation on different aspects of PL-based streaming TT models such as the impact of (1) shallow fusion of n-gram LMs, (2) contextual biasing with named entities, (3) chunk-wise decoding for low-latency streaming applications, and (4) TT overall performance as the function of the FSM size. Our results demonstrate that TT can be trained from scratch without supervised data, even with very noisy PLs. We validate the proposed framework on 6 languages
Authors
(none)
Tags
Stats
Related papers
- Improving Streaming Automatic Speech Recognition With Non-streaming Model Distillation On Unsupervised Data (2020)0.00
- Improving Streaming Transformer Based ASR Under A Framework Of Self-supervised Learning (2021)8.09
- Developing Real-time Streaming Transformer Transducer For Speech Recognition On Large-scale Dataset (2020)0.00
- Conv-transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-end Speech Recognition (2020)11.08
- Transducer-llama: Integrating Llms Into Streamable Transducer-based Speech Recognition (2024)3.58
- Large-scale Streaming End-to-end Speech Translation With Neural Transducers (2022)9.59
- Reducing The Gap Between Streaming And Non-streaming Transducer-based ASR By Adaptive Two-stage Knowledge Distillation (2023)4.52
- Transformer Transducer: One Model Unifying Streaming And Non-streaming Speech Recognition (2020)0.00