Cascade Rnn-transducer: Syllable Based Streaming On-device Mandarin Speech Recognition With A Syllable-to-character Converter
2020 Β· Xiong Wang, Zhuoyuan Yao, Xian Shi, et al.
Abstract
End-to-end models are favored in automatic speech recognition (ASR) because of its simplified system structure and superior performance. Among these models, recurrent neural network transducer (RNN-T) has achieved significant progress in streaming on-device speech recognition because of its high-accuracy and low-latency. RNN-T adopts a prediction network to enhance language information, but its language modeling ability is limited because it still needs paired speech-text data to train. Further strengthening the language modeling ability through extra text data, such as shallow fusion with an external language model, only brings a small performance gain. In view of the fact that Mandarin Chinese is a character-based language and each character is pronounced as a tonal syllable, this paper proposes a novel cascade RNN-T approach to improve the language modeling ability of RNN-T. Our approach firstly uses an RNN-T to transform acoustic feature into syllable sequence, and then converts th
Authors
(none)
Tags
Stats
Related papers
- Research On Modeling Units Of Transformer Transducer For Mandarin Speech Recognition (2020)0.00
- Exploring Rnn-transducer For Chinese Speech Recognition (2018)9.23
- Exploring Architectures, Data And Units For Streaming End-to-end Speech Recognition With Rnn-transducer (2018)16.21
- Transformer Transducer: A Streamable Speech Recognition Model With Transformer Encoders And RNN-T Loss (2020)18.58
- Self-attention Transducers For End-to-end Speech Recognition (2019)11.93
- One In A Hundred: Select The Best Predicted Sequence From Numerous Candidates For Streaming Speech Recognition (2020)0.00
- Improving RNN Transducer Modeling For End-to-end Speech Recognition (2019)0.00
- Conv-transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-end Speech Recognition (2020)11.08