Transducer-llama: Integrating Llms Into Streamable Transducer-based Speech Recognition
2024 Β· Keqi Deng, Jinxi Guo, Yingyi Ma, et al.
Abstract
While large language models (LLMs) have been applied to automatic speech recognition (ASR), the task of making the model streamable remains a challenge. This paper proposes a novel model architecture, Transducer-Llama, that integrates LLMs into a Factorized Transducer (FT) model, naturally enabling streaming capabilities. Furthermore, given that the large vocabulary of LLMs can cause data sparsity issue and increased training costs for spoken language systems, this paper introduces an efficient vocabulary adaptation technique to align LLMs with speech system vocabularies. The results show that directly optimizing the FT model with a strong pre-trained LLM-based predictor using the RNN-T loss yields some but limited improvements over a smaller pre-trained LM predictor. Therefore, this paper proposes a weak-to-strong LM swap strategy, using a weak LM predictor during RNN-T loss training and then replacing it with a strong LLM. After LM replacement, the minimum word error rate (MWER) loss
Authors
(none)
Tags
Stats
Related papers
- LAMASSU: Streaming Language-agnostic Multilingual Speech Recognition And Translation Using Neural Transducers (2022)7.50
- Adapting Large Language Model With Speech For Fully Formatted End-to-end Speech Recognition (2023)0.00
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59
- Zero-resource Speech Translation And Recognition With Llms (2024)3.58
- Improved Neural Language Model Fusion For Streaming Recurrent Neural Network Transducer (2020)8.82
- Large Language Model Can Transcribe Speech In Multi-talker Scenarios With Versatile Instructions (2024)11.23
- Exploring The Integration Of Large Language Models Into Automatic Speech Recognition Systems: An Empirical Study (2023)8.09
- Investigating Decoder-only Large Language Models For Speech-to-text Translation (2024)0.00