Harnessing The Zero-shot Power Of Instruction-tuned Large Language Model In End-to-end Speech Recognition
2023 Β· Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi
Abstract
We propose to utilize an instruction-tuned large language model (LLM) for guiding the text generation process in automatic speech recognition (ASR). Modern large language models (LLMs) are adept at performing various text generation tasks through zero-shot learning, prompted with instructions designed for specific objectives. This paper explores the potential of LLMs to derive linguistic information that can facilitate text generation in end-to-end ASR models. Specifically, we instruct an LLM to correct grammatical errors in an ASR hypothesis and use the LLM-derived representations to refine the output further. The proposed model is built on the joint CTC and attention architecture, with the LLM serving as a front-end feature extractor for the decoder. The ASR hypothesis, subject to correction, is obtained from the encoder via CTC decoding and fed into the LLM along with a specific instruction. The decoder subsequently takes as input the LLM output to perform token predictions, combini
Authors
(none)
Tags
Stats
Related papers
- Multi-stage Large Language Model Correction For Speech Recognition (2023)0.00
- Corpus Synthesis For Zero-shot ASR Domain Adaptation Using Large Language Models (2023)5.84
- Prompting Large Language Models For Zero-shot Domain Adaptation In Speech Recognition (2023)0.00
- Zero-resource Speech Translation And Recognition With Llms (2024)3.58
- Azeros: Extending LLM To Speech With Self-generated Instruction-free Tuning (2025)0.00
- Integrating Pre-trained Speech And Language Models For End-to-end Speech Recognition (2023)0.00
- Chain-of-thought Prompting For Speech Translation (2024)6.34
- Adapting Large Language Model With Speech For Fully Formatted End-to-end Speech Recognition (2023)0.00