Contextnet: Improving Convolutional Neural Networks For Automatic Speech Recognition With Global Context
2020 Β· Wei Han, Zhengdong Zhang, Yu Zhang, et al.
Abstract
Convolutional neural networks (CNN) have shown promising results for end-to-end speech recognition, albeit still behind other state-of-the-art methods in performance. In this paper, we study how to bridge this gap and go beyond with a novel CNN-RNN-transducer architecture, which we call ContextNet. ContextNet features a fully convolutional encoder that incorporates global context information into convolution layers by adding squeeze-and-excitation modules. In addition, we propose a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy. We demonstrate that on the widely used LibriSpeech benchmark, ContextNet achieves a word error rate (WER) of 2.1%/4.6% without external language model (LM), 1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy LibriSpeech test sets. This compares to the previous best published system of 2.0%/4.6% with LM and 3.9%/11.3% with 20M parameters. The superiority of the pro
Authors
(none)
Tags
Stats
Related papers
- Transformers With Convolutional Context For ASR (2019)0.00
- Constrained Convolutional-recurrent Networks To Improve Speech Quality With Low Impact On Recognition Accuracy (2018)5.24
- Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers (2021)10.07
- A Dual-staged Context Aggregation Method Towards Efficient End-to-end Speech Enhancement (2019)0.00
- Speaker Representation Learning Using Global Context Guided Channel And Time-frequency Transformations (2020)6.34
- PCNN: A Lightweight Parallel Conformer Neural Network For Efficient Monaural Speech Enhancement (2023)6.77
- Towards Effective And Compact Contextual Representation For Conformer Transducer Speech Recognition Systems (2023)7.16
- Self-consistent Context Aware Conformer Transducer For Speech Recognition (2024)0.00