Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers
2021 Β· Takaaki Hori, Niko Moritz, Chiori Hori, et al.
Abstract
This paper addresses end-to-end automatic speech recognition (ASR) for long audio recordings such as lecture and conversational speeches. Most end-to-end ASR models are designed to recognize independent utterances, but contextual information (e.g., speaker or topic) over multiple utterances is known to be useful for ASR. In our prior work, we proposed a context-expanded Transformer that accepts multiple consecutive utterances at the same time and predicts an output sequence for the last utterance, achieving 5-15% relative error reduction from utterance-based baselines in lecture and conversational ASR benchmarks. Although the results have shown remarkable performance gain, there is still potential to further improve the model architecture and the decoding process. In this paper, we extend our prior work by (1) introducing the Conformer architecture to further improve the accuracy, (2) accelerating the decoding process with a novel activation recycling technique, and (3) enabling stream
Authors
(none)
Tags
Stats
Related papers
- Towards Effective And Compact Contextual Representation For Conformer Transducer Speech Recognition Systems (2023)7.16
- Transformers With Convolutional Context For ASR (2019)0.00
- Improving Transformer-based Conversational ASR By Inter-sentential Attention Mechanism (2022)7.50
- Improving RNN-T ASR Accuracy Using Context Audio (2020)5.84
- Deep Context: End-to-end Contextual Speech Recognition (2018)15.57
- Fast Offline Transformer-based End-to-end Automatic Speech Recognition For Real-world Applications (2021)7.16
- Transformer ASR With Contextual Block Processing (2019)0.00
- Conv-transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-end Speech Recognition (2020)11.08