Towards Effective And Compact Contextual Representation For Conformer Transducer Speech Recognition Systems
2023 Β· Mingyu Cui, Jiawen Kang, Jiajun Deng, et al.
Abstract
Current ASR systems are mainly trained and evaluated at the utterance level. Long range cross utterance context can be incorporated. A key task is to derive a suitable compact representation of the most relevant history contexts. In contrast to previous researches based on either LSTM-RNN encoded histories that attenuate the information from longer range contexts, or frame level concatenation of transformer context embeddings, in this paper compact low-dimensional cross utterance contextual features are learned in the Conformer-Transducer Encoder using specially designed attention pooling layers that are applied over efficiently cached preceding utterances history vectors. Experiments on the 1000-hr Gigaspeech corpus demonstrate that the proposed contextualized streaming Conformer-Transducers outperform the baseline using utterance internal context only with statistically significant WER reductions of 0.7% to 0.5% absolute (4.3% to 3.1% relative) on the dev and test data.
Authors
(none)
Tags
Stats
Related papers
- Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers (2021)10.07
- Transformers With Convolutional Context For ASR (2019)0.00
- Dctx-conformer: Dynamic Context Carry-over For Low Latency Unified Streaming And Non-streaming Conformer ASR (2023)2.26
- Two Stage Contextual Word Filtering For Context Bias In Unified Streaming And Non-streaming Transducer (2023)7.16
- Self-consistent Context Aware Conformer Transducer For Speech Recognition (2024)0.00
- Improving RNN-T ASR Accuracy Using Context Audio (2020)5.84
- Conv-transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-end Speech Recognition (2020)11.08
- Leveraging Acoustic Contextual Representation By Audio-textual Cross-modal Learning For Conversational ASR (2022)0.00