Multi-stream End-to-end Speech Recognition
2019 Β· Ruizhi Li, Xiaofei Wang, Sri Harish Mallidi, et al.
Abstract
Attention-based methods and Connectionist Temporal Classification (CTC) network have been promising research directions for end-to-end (E2E) Automatic Speech Recognition (ASR). The joint CTC/Attention model has achieved great success by utilizing both architectures during multi-task training and joint decoding. In this work, we present a multi-stream framework based on joint CTC/Attention E2E ASR with parallel streams represented by separate encoders aiming to capture diverse information. On top of the regular attention networks, the Hierarchical Attention Network (HAN) is introduced to steer the decoder toward the most informative encoders. A separate CTC network is assigned to each stream to force monotonic alignments. Two representative framework have been proposed and discussed, which are Multi-Encoder Multi-Resolution (MEM-Res) framework and Multi-Encoder Multi-Array (MEM-Array) framework, respectively. In MEM-Res framework, two heterogeneous encoders with different architectures,
Authors
(none)
Tags
Stats
Related papers
- Multi-encoder Multi-resolution Framework For End-to-end Speech Recognition (2018)0.00
- Stream Attention-based Multi-array End-to-end Speech Recognition (2018)0.00
- Joint Ctc-attention Based End-to-end Speech Recognition Using Multi-task Learning (2016)20.43
- Unified Streaming And Non-streaming Two-pass End-to-end Model For Speech Recognition (2020)0.00
- 4D ASR: Joint Modeling Of CTC, Attention, Transducer, And Mask-predict Decoders (2022)7.50
- Online Hybrid Ctc/attention End-to-end Automatic Speech Recognition Architecture (2023)12.99
- E2e-based Multi-task Learning Approach To Joint Speech And Accent Recognition (2021)0.00
- Joint Beam Search Integrating CTC, Attention, And Transducer Decoders (2024)5.24