Hierarchical Transformer-based Large-context End-to-end ASR With Large-context Knowledge Distillation
2021 Β· Ryo Masumura, Naoki Makishima, Mana Ihori, et al.
Abstract
We present a novel large-context end-to-end automatic speech recognition (E2E-ASR) model and its effective training method based on knowledge distillation. Common E2E-ASR models have mainly focused on utterance-level processing in which each utterance is independently transcribed. On the other hand, large-context E2E-ASR models, which take into account long-range sequential contexts beyond utterance boundaries, well handle a sequence of utterances such as discourses and conversations. However, the transformer architecture, which has recently achieved state-of-the-art ASR performance among utterance-level ASR systems, has not yet been introduced into the large-context ASR systems. We can expect that the transformer architecture can be leveraged for effectively capturing not only input speech contexts but also long-range sequential contexts beyond utterance boundaries. Therefore, this paper proposes a hierarchical transformer-based large-context E2E-ASR model that combines the transforme
Authors
(none)
Tags
Stats
Related papers
- Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers (2021)10.07
- Knowledge Transfer From Large-scale Pretrained Language Models To End-to-end Speech Recognizers (2022)9.41
- Distilling Knowledge From Ensembles Of Acoustic Models For Joint Ctc-attention End-to-end Speech Recognition (2020)8.09
- Reducing The Gap Between Streaming And Non-streaming Transducer-based ASR By Adaptive Two-stage Knowledge Distillation (2023)4.52
- Leave No Knowledge Behind During Knowledge Distillation: Towards Practical And Effective Knowledge Distillation For Code-switching ASR Using Realistic Data (2024)3.58
- Knowledge Distillation From Language Model To Acoustic Model: A Hierarchical Multi-task Learning Approach (2021)3.58
- End-to-end Speech Translation With Knowledge Distillation (2019)0.00
- Sequence-level Knowledge Distillation For Class-incremental End-to-end Spoken Language Understanding (2023)0.00