Decoupling Pronunciation And Language For End-to-end Code-switching Automatic Speech Recognition
2020 Β· Shuai Zhang, Jiangyan Yi, Zhengkun Tian, et al.
Abstract
Despite the recent significant advances witnessed in end-to-end (E2E) ASR system for code-switching, hunger for audio-text paired data limits the further improvement of the models' performance. In this paper, we propose a decoupled transformer model to use monolingual paired data and unpaired text data to alleviate the problem of code-switching data shortage. The model is decoupled into two parts: audio-to-phoneme (A2P) network and phoneme-to-text (P2T) network. The A2P network can learn acoustic pattern scenarios using large-scale monolingual paired data. Meanwhile, it generates multiple phoneme sequence candidates for single audio data in real-time during the training process. Then the generated phoneme-text paired data is used to train the P2T network. This network can be pre-trained with large amounts of external unpaired text data. By using monolingual data and unpaired text data, the decoupled transformer model reduces the high dependency on code-switching paired training data of
Authors
(none)
Tags
Stats
Related papers
- Transformer-transducers For Code-switched Speech Recognition (2020)10.97
- An Effective Mixture-of-experts Approach For Code-switching Speech Recognition Leveraging Encoder Disentanglement (2024)0.00
- Constrained Output Embeddings For End-to-end Code-switching Speech Recognition With Only Monolingual Data (2019)7.16
- End-to-end Code-switching ASR For Low-resourced Language Pairs (2019)9.76
- Pre-training Transformer Decoder For End-to-end ASR Model With Unpaired Speech Data (2022)13.47
- Data Augmentation For End-to-end Code-switching Speech Recognition (2020)9.92
- Language-agnostic Code-switching In Sequence-to-sequence Speech Recognition (2022)0.00
- Effective Decoder Masking For Transformer Based End-to-end Speech Recognition (2020)0.00