Multi-encoder Multi-resolution Framework For End-to-end Speech Recognition
2018 Β· Ruizhi Li, Xiaofei Wang, Sri Harish Mallidi, et al.
Abstract
Attention-based methods and Connectionist Temporal Classification (CTC) network have been promising research directions for end-to-end Automatic Speech Recognition (ASR). The joint CTC/Attention model has achieved great success by utilizing both architectures during multi-task training and joint decoding. In this work, we present a novel Multi-Encoder Multi-Resolution (MEMR) framework based on the joint CTC/Attention model. Two heterogeneous encoders with different architectures, temporal resolutions and separate CTC networks work in parallel to extract complimentary acoustic information. A hierarchical attention mechanism is then used to combine the encoder-level information. To demonstrate the effectiveness of the proposed model, experiments are conducted on Wall Street Journal (WSJ) and CHiME-4, resulting in relative Word Error Rate (WER) reduction of 18.0-32.1%. Moreover, the proposed MEMR model achieves 3.6% WER in the WSJ eval92 test set, which is the best WER reported for an end
Authors
(none)
Tags
Stats
Related papers
- Multi-stream End-to-end Speech Recognition (2019)8.35
- Joint Ctc-attention Based End-to-end Speech Recognition Using Multi-task Learning (2016)20.43
- Stream Attention-based Multi-array End-to-end Speech Recognition (2018)0.00
- E2e-based Multi-task Learning Approach To Joint Speech And Accent Recognition (2021)0.00
- An Improved Hybrid Ctc-attention Model For Speech Recognition (2018)0.00
- 3M: Multi-loss, Multi-path And Multi-level Neural Networks For Speech Recognition (2022)8.67
- 4D ASR: Joint Modeling Of CTC, Attention, Transducer, And Mask-predict Decoders (2022)7.50
- Advances In Joint Ctc-attention Based End-to-end Speech Recognition With A Deep CNN Encoder And RNN-LM (2017)16.49