Streaming End-to-end Bilingual ASR Systems With Joint Language Identification
2020 Β· Surabhi Punjabi, Harish Arsikere, Zeynab Raeesy, et al.
Abstract
Multilingual ASR technology simplifies model training and deployment, but its accuracy is known to depend on the availability of language information at runtime. Since language identity is seldom known beforehand in real-world scenarios, it must be inferred on-the-fly with minimum latency. Furthermore, in voice-activated smart assistant systems, language identity is also required for downstream processing of ASR output. In this paper, we introduce streaming, end-to-end, bilingual systems that perform both ASR and language identification (LID) using the recurrent neural network transducer (RNN-T) architecture. On the input side, embeddings from pretrained acoustic-only LID classifiers are used to guide RNN-T training and inference, while on the output side, language targets are jointly modeled with ASR targets. The proposed method is applied to two language pairs: English-Spanish as spoken in the United States, and English-Hindi as spoken in India. Experiments show that for English-Span
Authors
(none)
Tags
Stats
Related papers
- Streaming Language Identification Using Combination Of Acoustic Representations And ASR Hypotheses (2020)0.00
- Large-scale Multilingual Speech Recognition With A Streaming End-to-end Model (2019)14.97
- Rnn-transducer With Language Bias For End-to-end Mandarin-english Code-switching Speech Recognition (2020)8.09
- Multilingual Speech Recognition With A Single End-to-end Model (2017)16.05
- Joint Language Identification Of Code-switching Speech Using Attention Based E2E Network (2019)5.24
- Advanced Accent/dialect Identification And Accentedness Assessment With Multi-embedding Models And Automatic Speech Recognition (2023)7.16
- Streaming Multi-speaker ASR With RNN-T (2020)10.07
- Improved Neural Language Model Fusion For Streaming Recurrent Neural Network Transducer (2020)8.82