An Investigation Of Monotonic Transducers For Large-scale Automatic Speech Recognition
2022 Β· Niko Moritz, Frank Seide, Duc Le, et al.
Abstract
The two most popular loss functions for streaming end-to-end automatic speech recognition (ASR) are RNN-Transducer (RNN-T) and connectionist temporal classification (CTC). Between these two loss types we can classify the monotonic RNN-T (MonoRNN-T) and the recently proposed CTC-like Transducer (CTC-T). Monotonic transducers have a few advantages. First, RNN-T can suffer from runaway hallucination, where a model keeps emitting non-blank symbols without advancing in time. Secondly, monotonic transducers consume exactly one model score per time step and are therefore more compatible with traditional FST-based ASR decoders. However, the MonoRNN-T so far has been found to have worse accuracy than RNN-T. It does not have to be that way: By regularizing the training via joint LAS training or parameter initialization from RNN-T, both MonoRNN-T and CTC-T perform as well or better than RNN-T. This is demonstrated for LibriSpeech and for a large-scale in-house data set.
Authors
(none)
Tags
Stats
Related papers
- Improving RNN Transducer Modeling For End-to-end Speech Recognition (2019)0.00
- Alignment Restricted Streaming Recurrent Neural Network Transducer (2020)11.19
- CIF-T: A Novel Cif-based Transducer Architecture For Automatic Speech Recognition (2023)0.00
- Exploring Neural Transducers For End-to-end Speech Recognition (2017)14.90
- Exploring Architectures, Data And Units For Streaming End-to-end Speech Recognition With Rnn-transducer (2018)16.21
- Exploring Rnn-transducer For Chinese Speech Recognition (2018)9.23
- Transformer Transducer: A Streamable Speech Recognition Model With Transformer Encoders And RNN-T Loss (2020)18.58
- Residual Convolutional CTC Networks For Automatic Speech Recognition (2017)0.00