Effective Decoder Masking For Transformer Based End-to-end Speech Recognition
2020 Β· Shi-Yan Weng, Berlin Chen
Abstract
The attention-based encoder-decoder modeling paradigm has achieved promising results on a variety of speech processing tasks like automatic speech recognition (ASR), text-to-speech (TTS) and among others. This paradigm takes advantage of the generalization ability of neural networks to learn a direct mapping from an input sequence to an output sequence, without recourse to prior knowledge such as audio-text alignments or pronunciation lexicons. However, ASR models stemming from this paradigm are prone to overfitting, especially when the training data is limited. Inspired by SpecAugment and BERT-like masked language modeling, we propose in the paper a decoder masking based training approach for end-to-end (E2E) ASR models. During the training phase we randomly replace some portions of the decoder's historical text input with the symbol [mask], in order to encourage the decoder to robustly output a correct token even when parts of its decoding history are masked or corrupted. The propose
Authors
(none)
Tags
Stats
Related papers
- Semantic Mask For Transformer Based End-to-end Speech Recognition (2019)9.41
- Loss Masking Is Not Needed In Decoder-only Transformer For Discrete-token-based ASR (2023)10.56
- Transformer-based Online Speech Recognition With Decoder-end Adaptive Computation Steps (2020)7.81
- Pre-training Transformer Decoder For End-to-end ASR Model With Unpaired Speech Data (2022)13.47
- Improved Mask-ctc For Non-autoregressive End-to-end ASR (2020)11.76
- Improving Transformer-based Speech Recognition Using Unsupervised Pre-training (2019)0.00
- Optimizing Alignment Of Speech And Language Latent Spaces For End-to-end Speech Recognition And Understanding (2021)9.03
- Synchronous Transformers For End-to-end Speech Recognition (2019)12.02