Research On Modeling Units Of Transformer Transducer For Mandarin Speech Recognition
2020 Β· Li Fu, Xiaoxiao Li, Libo Zi
Abstract
Modeling unit and model architecture are two key factors of Recurrent Neural Network Transducer (RNN-T) in end-to-end speech recognition. To improve the performance of RNN-T for Mandarin speech recognition task, a novel transformer transducer with the combination architecture of self-attention transformer and RNN is proposed. And then the choice of different modeling units for transformer transducer is explored. In addition, we present a new mix-bandwidth training method to obtain a general model that is able to accurately recognize Mandarin speech with different sampling rates simultaneously. All of our experiments are conducted on about 12,000 hours of Mandarin speech with sampling rate in 8kHz and 16kHz. Experimental results show that Mandarin transformer transducer using syllable with tone achieves the best performance. It yields an average of 14.4% and 44.1% relative Word Error Rate (WER) reduction when compared with the models using syllable initial/final with tone and Chinese ch
Authors
(none)
Tags
Stats
Related papers
- A Comparison Of Modeling Units In Sequence-to-sequence Speech Recognition With The Transformer On Mandarin Chinese (2018)11.39
- Cascade Rnn-transducer: Syllable Based Streaming On-device Mandarin Speech Recognition With A Syllable-to-character Converter (2020)9.92
- Exploring Rnn-transducer For Chinese Speech Recognition (2018)9.23
- Improving RNN Transducer Modeling For End-to-end Speech Recognition (2019)0.00
- Multitask Learning And Joint Optimization For Transformer-rnn-transducer Speech Recognition (2020)8.09
- Exploring Architectures, Data And Units For Streaming End-to-end Speech Recognition With Rnn-transducer (2018)16.21
- Self-attention Transducers For End-to-end Speech Recognition (2019)11.93
- Transformer Transducer: One Model Unifying Streaming And Non-streaming Speech Recognition (2020)0.00