Streaming End-to-end Speech Recognition With Jointly Trained Neural Feature Enhancement
2021 Β· Chanwoo Kim, Abhinav Garg, Dhananjaya Gowda, et al.
Abstract
In this paper, we present a streaming end-to-end speech recognition model based on Monotonic Chunkwise Attention (MoCha) jointly trained with enhancement layers. Even though the MoCha attention enables streaming speech recognition with recognition accuracy comparable to a full attention-based approach, training this model is sensitive to various factors such as the difficulty of training examples, hyper-parameters, and so on. Because of these issues, speech recognition accuracy of a MoCha-based model for clean speech drops significantly when a multi-style training approach is applied. Inspired by Curriculum Learning [1], we introduce two training strategies: Gradual Application of Enhanced Features (GAEF) and Gradual Reduction of Enhanced Loss (GREL). With GAEF, the model is initially trained using clean features. Subsequently, the portion of outputs from the enhancement layers gradually increases. With GREL, the portion of the Mean Squared Error (MSE) loss for the enhanced output grad
Authors
(none)
Tags
Stats
Related papers
- Alignment Knowledge Distillation For Online Streaming Attention-based Speech Recognition (2021)7.16
- A Comparison Of Streaming Models And Data Augmentation Methods For Robust Speech Recognition (2021)2.26
- Streaming Chunk-aware Multihead Attention For Online End-to-end Speech Recognition (2020)8.60
- Unified Streaming And Non-streaming Two-pass End-to-end Model For Speech Recognition (2020)0.00
- Chunked Attention-based Encoder-decoder Model For Streaming Speech Recognition (2023)7.81
- An Online Attention-based Model For Speech Recognition (2018)9.59
- Streaming Attention-based Models With Augmented Memory For End-to-end Speech Recognition (2020)5.84
- Stableemit: Selection Probability Discount For Reducing Emission Latency Of Streaming Monotonic Attention ASR (2021)3.58