Adapting Large Language Model With Speech For Fully Formatted End-to-end Speech Recognition
2023 Β· Shaoshi Ling, Yuxuan Hu, Shuangbei Qian, et al.
Abstract
Most end-to-end (E2E) speech recognition models are composed of encoder and decoder blocks that perform acoustic and language modeling functions. Pretrained large language models (LLMs) have the potential to improve the performance of E2E ASR. However, integrating a pretrained language model into an E2E speech recognition model has shown limited benefits due to the mismatches between text-based LLMs and those used in E2E ASR. In this paper, we explore an alternative approach by adapting a pretrained LLMs to speech. Our experiments on fully-formatted E2E ASR transcription tasks across various domains demonstrate that our approach can effectively leverage the strengths of pretrained LLMs to produce more readable ASR transcriptions. Our model, which is based on the pretrained large language models with either an encoder-decoder or decoder-only structure, surpasses strong ASR models such as Whisper, in terms of recognition error rate, considering formats like punctuation and capitalization
Authors
(none)
Tags
Stats
Related papers
- Integrating Pre-trained Speech And Language Models For End-to-end Speech Recognition (2023)0.00
- End-to-end Speech Recognition Contextualization With Large Language Models (2023)0.00
- Multi-stage Large Language Model Correction For Speech Recognition (2023)0.00
- Delayed Fusion: Integrating Large Language Models Into First-pass Decoding In End-to-end Speech Recognition (2025)5.84
- Speech Recognition With Llms Adapted To Disordered Speech Using Reinforcement Learning (2024)5.24
- Transducer-llama: Integrating Llms Into Streamable Transducer-based Speech Recognition (2024)3.58
- A Comprehensive Solution To Connect Speech Encoder And Large Language Model For ASR (2024)0.00
- Exploring The Integration Of Large Language Models Into Automatic Speech Recognition Systems: An Empirical Study (2023)8.09