From Senones To Chenones: Tied Context-dependent Graphemes For Hybrid Speech Recognition
2019 Β· Duc Le, Xiaohui Zhang, Weiyi Zheng, et al.
Abstract
There is an implicit assumption that traditional hybrid approaches for automatic speech recognition (ASR) cannot directly model graphemes and need to rely on phonetic lexicons to get competitive performance, especially on English which has poor grapheme-phoneme correspondence. In this work, we show for the first time that, on English, hybrid ASR systems can in fact model graphemes effectively by leveraging tied context-dependent graphemes, i.e., chenones. Our chenone-based systems significantly outperform equivalent senone baselines by 4.5% to 11.1% relative on three different English datasets. Our results on Librispeech are state-of-the-art compared to other hybrid approaches and competitive with previously published end-to-end numbers. Further analysis shows that chenones can better utilize powerful acoustic models and large training data, and require context- and position-dependent modeling to work well. Chenone-based systems also outperform senone baselines on proper noun and rare
Authors
(none)
Tags
Stats
Related papers
- G2G: Tts-driven Pronunciation Learning For Graphemic Hybrid ASR (2019)8.35
- Phonetic And Graphemic Systems For Multi-genre Broadcast Transcription (2018)7.81
- On The Choice Of Modeling Unit For Sequence-to-sequence Speech Recognition (2019)9.59
- Analyzing Phonetic And Graphemic Representations In End-to-end Automatic Speech Recognition (2019)9.23
- A Systematic Comparison Of Grapheme-based Vs. Phoneme-based Label Units For Encoder-decoder-attention Models (2020)0.00
- Deep Context: End-to-end Contextual Speech Recognition (2018)15.57
- RWTH ASR Systems For Librispeech: Hybrid Vs Attention -- W/o Data Augmentation (2019)15.34
- Towards A Competitive End-to-end Speech Recognition For Chime-6 Dinner Party Transcription (2020)6.77