Neural Speech Recognizer: Acoustic-to-word LSTM Model For Large Vocabulary Speech Recognition
2016 Β· Hagen Soltau, Hank Liao, Hasim Sak
Abstract
We present results that show it is possible to build a competitive, greatly simplified, large vocabulary continuous speech recognition system with whole words as acoustic units. We model the output vocabulary of about 100,000 words directly using deep bi-directional LSTM RNNs with CTC loss. The model is trained on 125,000 hours of semi-supervised acoustic training data, which enables us to alleviate the data sparsity problem for word models. We show that the CTC word models work very well as an end-to-end all-neural speech recognition model without the use of traditional context-dependent sub-word phone units that require a pronunciation lexicon, and without any language model removing the need to decode. We demonstrate that the CTC word models perform better than a strong, more complex, state-of-the-art baseline with sub-word units.
Authors
(none)
Tags
Stats
Related papers
- Phoneme Based Neural Transducer For Large Vocabulary Speech Recognition (2020)9.59
- Deep LSTM For Large Vocabulary Continuous Speech Recognition (2017)14.58
- Long Short-term Memory Based Convolutional Recurrent Neural Networks For Large Vocabulary Speech Recognition (2016)6.77
- Advances In All-neural Speech Recognition (2016)11.29
- Personalized Speech Recognition On Mobile Devices (2016)15.37
- End-to-end Speech Recognition Using A High Rank LSTM-CTC Based Model (2019)11.54
- Residual Convolutional CTC Networks For Automatic Speech Recognition (2017)0.00
- Ctc-segmentation Of Large Corpora For German End-to-end Speech Recognition (2020)12.93