Mask-ctc-based Encoder Pre-training For Streaming End-to-end Speech Recognition
2023 Β· Huaibo Zhao, Yosuke Higuchi, Yusuke Kida, et al.
Abstract
Achieving high accuracy with low latency has always been a challenge in streaming end-to-end automatic speech recognition (ASR) systems. By attending to more future contexts, a streaming ASR model achieves higher accuracy but results in larger latency, which hurts the streaming performance. In the Mask-CTC framework, an encoder network is trained to learn the feature representation that anticipates long-term contexts, which is desirable for streaming ASR. Mask-CTC-based encoder pre-training has been shown beneficial in achieving low latency and high accuracy for triggered attention-based ASR. However, the effectiveness of this method has not been demonstrated for various model architectures, nor has it been verified that the encoder has the expected look-ahead capability to reduce latency. This study, therefore, examines the effectiveness of Mask-CTCbased pre-training for models with different architectures, such as Transformer-Transducer and contextual block streaming ASR. We also dis
Authors
(none)
Tags
Stats
Related papers
- An Investigation Of Enhancing CTC Model For Triggered Attention-based Streaming ASR (2021)0.00
- Improved Mask-ctc For Non-autoregressive End-to-end ASR (2020)11.76
- Dynamic Latency For Ctc-based Streaming Automatic Speech Recognition With Emformer (2022)0.00
- Alignment Knowledge Distillation For Online Streaming Attention-based Speech Recognition (2021)7.16
- Improving Streaming Speech Recognition With Time-shifted Contextual Attention And Dynamic Right Context Masking (2025)2.26
- Focused Discriminative Training For Streaming Ctc-trained Automatic Speech Recognition Models (2024)0.00
- Multi-stream End-to-end Speech Recognition (2019)8.35
- Minimum Latency Training Strategies For Streaming Sequence-to-sequence ASR (2020)10.07