ASAPP-ASR: Multistream CNN And Self-attentive SRU For SOTA Speech Recognition
2020 Β· Jing Pan, Joshua Shapiro, Jeremy Wohlwend, et al.
Abstract
In this paper we present state-of-the-art (SOTA) performance on the LibriSpeech corpus with two novel neural network architectures, a multistream CNN for acoustic modeling and a self-attentive simple recurrent unit (SRU) for language modeling. In the hybrid ASR framework, the multistream CNN acoustic model processes an input of speech frames in multiple parallel pipelines where each stream has a unique dilation rate for diversity. Trained with the SpecAugment data augmentation method, it achieves relative word error rate (WER) improvements of 4% on test-clean and 14% on test-other. We further improve the performance via N-best rescoring using a 24-layer self-attentive SRU language model, achieving WERs of 1.75% on test-clean and 4.46% on test-other.
Authors
(none)
Tags
Stats
Related papers
- Streaming Multi-speaker ASR With RNN-T (2020)10.07
- RWTH ASR Systems For Librispeech: Hybrid Vs Attention -- W/o Data Augmentation (2019)15.34
- Multistream CNN For Robust Acoustic Modeling (2020)10.21
- ASR Performance Prediction On Unseen Broadcast Programs Using Convolutional Neural Networks (2018)3.58
- Streaming Audio-visual Speech Recognition With Alignment Regularization (2022)3.58
- Conformer-based Target-speaker Automatic Speech Recognition For Single-channel Audio (2023)9.41
- Non-autoregressive End-to-end Approaches For Joint Automatic Speech Recognition And Spoken Language Understanding (2023)5.84
- The RWTH ASR System For TED-LIUM Release 2: Improving Hybrid HMM With Specaugment (2020)10.21