An Investigation Of Enhancing CTC Model For Triggered Attention-based Streaming ASR
2021 Β· Huaibo Zhao, Yosuke Higuchi, Tetsuji Ogawa, et al.
Abstract
In the present paper, an attempt is made to combine Mask-CTC and the triggered attention mechanism to construct a streaming end-to-end automatic speech recognition (ASR) system that provides high performance with low latency. The triggered attention mechanism, which performs autoregressive decoding triggered by the CTC spike, has shown to be effective in streaming ASR. However, in order to maintain high accuracy of alignment estimation based on CTC outputs, which is the key to its performance, it is inevitable that decoding should be performed with some future information input (i.e., with higher latency). It should be noted that in streaming ASR, it is desirable to be able to achieve high recognition accuracy while keeping the latency low. Therefore, the present study aims to achieve highly accurate streaming ASR with low latency by introducing Mask-CTC, which is capable of learning feature representations that anticipate future information (i.e., that can consider long-term contexts)
Authors
(none)
Tags
Stats
Related papers
- Mask-ctc-based Encoder Pre-training For Streaming End-to-end Speech Recognition (2023)0.00
- Improved Mask-ctc For Non-autoregressive End-to-end ASR (2020)11.76
- Alignment Knowledge Distillation For Online Streaming Attention-based Speech Recognition (2021)7.16
- Online Hybrid Ctc/attention End-to-end Automatic Speech Recognition Architecture (2023)12.99
- Bridging The Gap Between Streaming And Non-streaming ASR Systems Bydistilling Ensembles Of CTC And RNN-T Models (2021)3.58
- Focused Discriminative Training For Streaming Ctc-trained Automatic Speech Recognition Models (2024)0.00
- CUSIDE-T: Chunking, Simulating Future And Decoding For Transducer Based Streaming ASR (2024)2.26
- Unified Streaming And Non-streaming Two-pass End-to-end Model For Speech Recognition (2020)0.00