Deep Context: End-to-end Contextual Speech Recognition
2018 Β· Golan Pundak, Tara N. Sainath, Rohit Prabhavalkar, et al.
Abstract
In automatic speech recognition (ASR) what a user says depends on the particular context she is in. Typically, this context is represented as a set of word n-grams. In this work, we present a novel, all-neural, end-to-end (E2E) ASR sys- tem that utilizes such context. Our approach, which we re- fer to as Contextual Listen, Attend and Spell (CLAS) jointly- optimizes the ASR components along with embeddings of the context n-grams. During inference, the CLAS system can be presented with context phrases which might contain out-of- vocabulary (OOV) terms not seen during training. We com- pare our proposed system to a more traditional contextualiza- tion approach, which performs shallow-fusion between inde- pendently trained LAS and contextual n-gram models during beam search. Across a number of tasks, we find that the pro- posed CLAS system outperforms the baseline method by as much as 68% relative WER, indicating the advantage of joint optimization over individually trained components. Ind
Authors
(none)
Tags
Stats
Related papers
- Cif-based Collaborative Decoding For End-to-end Contextual Speech Recognition (2020)9.76
- End-to-end Contextual Speech Recognition Using Class Language Models And A Token Passing Decoder (2018)11.08
- Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers (2021)10.07
- Improving Neural Biasing For Contextual Speech Recognition By Early Context Injection And Text Perturbation (2024)8.09
- Leveraging Acoustic Contextual Representation By Audio-textual Cross-modal Learning For Conversational ASR (2022)0.00
- Fast Contextual Adaptation With Neural Associative Memory For On-device Personalized Speech Recognition (2021)9.76
- Efficient Long-form Speech Recognition For General Speech In-context Learning (2024)0.00
- Deep Contextualized Acoustic Representations For Semi-supervised Speech Recognition (2019)14.62