An Improved Hybrid Ctc-attention Model For Speech Recognition
2018 Β· Zhe Yuan, Zhuoran Lyu, Jiwei Li, et al.
Abstract
Recently, end-to-end speech recognition with a hybrid model consisting of the connectionist temporal classification(CTC) and the attention encoder-decoder achieved state-of-the-art results. In this paper, we propose a novel CTC decoder structure based on the experiments we conducted and explore the relation between decoding performance and the depth of encoder. We also apply attention smoothing mechanism to acquire more context information for subword-based decoding. Taken together, these strategies allow us to achieve a word error rate(WER) of 4.43% without LM and 3.34% with RNN-LM on the test-clean subset of the LibriSpeech corpora, which by far are the best reported WERs for end-to-end ASR systems on this dataset.
Authors
(none)
Tags
Stats
Related papers
- Linguistic-enhanced Transformer With CTC Embedding For Speech Recognition (2022)2.26
- Advances In Joint Ctc-attention Based End-to-end Speech Recognition With A Deep CNN Encoder And RNN-LM (2017)16.49
- Hybrid Ctc-attention Based End-to-end Speech Recognition Using Subword Units (2018)10.85
- Improving Hybrid Ctc/attention End-to-end Speech Recognition With Pretrained Acoustic And Language Model (2021)8.82
- Joint Ctc-attention Based End-to-end Speech Recognition Using Multi-task Learning (2016)20.43
- Multi-encoder Multi-resolution Framework For End-to-end Speech Recognition (2018)0.00
- Self-attention Networks For Connectionist Temporal Classification In Speech Recognition (2019)14.55
- Audio Adversarial Examples For Robust Hybrid Ctc/attention Speech Recognition (2020)3.58