Multitask Training With Text Data For End-to-end Speech Recognition
2020 Β· Peidong Wang, Tara N. Sainath, Ron J. Weiss
Abstract
We propose a multitask training method for attention-based end-to-end speech recognition models. We regularize the decoder in a listen, attend, and spell model by multitask training it on both audio-text and text-only data. Trained on the 100-hour subset of LibriSpeech, the proposed method, without requiring an additional language model, leads to an 11% relative performance improvement over the baseline and approaches the performance of language model shallow fusion on the test-clean evaluation set. We observe a similar trend on the whole 960-hour LibriSpeech training set. Analyses of different types of errors and sample output sentences demonstrate that the proposed method can incorporate language level information, suggesting its effectiveness in real-world applications.
Authors
(none)
Tags
Stats
Related papers
- Multitask Learning And Joint Optimization For Transformer-rnn-transducer Speech Recognition (2020)8.09
- A General Multi-task Learning Framework To Leverage Text Data For Speech To Text Tasks (2020)11.67
- Joint Ctc-attention Based End-to-end Speech Recognition Using Multi-task Learning (2016)20.43
- Decoder-only Architecture For Speech Recognition With CTC Prompts And Text Data Augmentation (2023)0.00
- Improving Speech Translation By Understanding And Learning From The Auxiliary Text Translation Task (2021)10.97
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00
- A Spelling Correction Model For End-to-end Speech Recognition (2019)14.62
- An Improved Hybrid Ctc-attention Model For Speech Recognition (2018)0.00