On Decoder-only Architecture For Speech-to-text And Large Language Model Integration
2023 Β· Jian Wu, Yashesh Gaur, Zhuo Chen, et al.
Abstract
Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been explored well. The "decoder-only" architecture has also not been well studied for speech processing tasks. In this research, we introduce Speech-LLaMA, a novel approach that effectively incorporates acoustic information into text-based large language models. Our method leverages Connectionist Temporal Classification and a simple audio encoder to map the compressed acoustic features to the continuous semantic space of the LLM. In addition, we further probe the decoder-only architecture for speech-to-text tasks by training a smaller scale randomly initialized speech-LLaMA model from speech-text paired data alone. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baseli
Authors
(none)
Tags
Stats
Related papers
- Investigating Decoder-only Large Language Models For Speech-to-text Translation (2024)0.00
- A Comprehensive Solution To Connect Speech Encoder And Large Language Model For ASR (2024)0.00
- Ideal-llm: Integrating Dual Encoders And Language-adapted LLM For Multilingual Speech-to-text (2024)5.24
- Zero-resource Speech Translation And Recognition With Llms (2024)3.58
- Recent Advances In Speech Language Models: A Survey (2024)14.64
- Delayed Fusion: Integrating Large Language Models Into First-pass Decoding In End-to-end Speech Recognition (2025)5.84
- From Hype To Insight: Rethinking Large Language Model Integration In Visual Speech Recognition (2025)0.00
- End-to-end Speech Recognition Contextualization With Large Language Models (2023)0.00