End-to-end Speech Recognition Contextualization With Large Language Models
2023 Β· Egor Lakomkin, Chunyang Wu, Yassir Fathullah, et al.
Abstract
In recent years, Large Language Models (LLMs) have garnered significant attention from the research community due to their exceptional performance and generalization capabilities. In this paper, we introduce a novel method for contextualizing speech recognition models incorporating LLMs. Our approach casts speech recognition as a mixed-modal language modeling task based on a pretrained LLM. We provide audio features, along with optional text tokens for context, to train the system to complete transcriptions in a decoder-only fashion. As a result, the system is implicitly incentivized to learn how to leverage unstructured contextual information during training. Our empirical results demonstrate a significant improvement in performance, with a 6% WER reduction when additional textual context is provided. Moreover, we find that our method performs competitively and improve by 7.5% WER overall and 17% WER on rare words against a baseline contextualized RNN-T system that has been trained on
Authors
(none)
Tags
Stats
Related papers
- Enhancing Large Language Model-based Speech Recognition By Contextualization For Rare And Ambiguous Words (2024)0.00
- Adapting Large Language Model With Speech For Fully Formatted End-to-end Speech Recognition (2023)0.00
- Attention-based Contextual Language Model Adaptation For Speech Recognition (2021)0.00
- Multi-stage Large Language Model Correction For Speech Recognition (2023)0.00
- End-to-end Contextual Speech Recognition Using Class Language Models And A Token Passing Decoder (2018)11.08
- Enhancing Speaker Diarization With Large Language Models: A Contextual Beam Search Approach (2023)7.50
- Large Language Model Can Transcribe Speech In Multi-talker Scenarios With Versatile Instructions (2024)11.23
- Exploring The Integration Of Large Language Models Into Automatic Speech Recognition Systems: An Empirical Study (2023)8.09