Exploring The Limits Of Decoder-only Models Trained On Public Speech Recognition Corpora
2024 Β· Ankit Gupta, George Saon, Brian Kingsbury
Abstract
The emergence of industrial-scale speech recognition (ASR) models such as Whisper and USM, trained on 1M hours of weakly labelled and 12M hours of audio only proprietary data respectively, has led to a stronger need for large scale public ASR corpora and competitive open source pipelines. Unlike the said models, large language models are typically based on Transformer decoders, and it remains unclear if decoder-only models trained on public data alone can deliver competitive performance. In this work, we investigate factors such as choice of training datasets and modeling components necessary for obtaining the best performance using public English ASR corpora alone. Our Decoder-Only Transformer for ASR (DOTA) model comprehensively outperforms the encoder-decoder open source replication of Whisper (OWSM) on nearly all English ASR benchmarks and outperforms Whisper large-v3 on 7 out of 15 test sets. We release our codebase and model checkpoints under permissive license.
Authors
(none)
Tags
Stats
Related papers
- Less Is More: Accurate Speech Recognition & Translation Without Web-scale Data (2024)0.00
- Codec-asr: Training Performant Automatic Speech Recognition Systems With Discrete Speech Representations (2024)6.77
- OWSM-CTC: An Open Encoder-only Speech Foundation Model For Speech Recognition, Translation, And Language Identification (2024)8.60
- On The Transferability Of Whisper-based Representations For "in-the-wild" Cross-task Downstream Speech Applications (2023)0.00
- Bigssl: Exploring The Frontier Of Large-scale Semi-supervised Learning For Automatic Speech Recognition (2021)15.73
- Fast Streaming Transducer ASR Prototyping Via Knowledge Distillation With Whisper (2024)2.26
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00
- Streaming Decoder-only Automatic Speech Recognition With Discrete Speech Units: A Pilot Study (2024)4.52