A Survey On Speech Large Language Models For Understanding
2024 Β· Jing Peng, Yucheng Wang, Bohan Li, et al.
Abstract
Speech understanding is essential for interpreting the diverse forms of information embedded in spoken language, including linguistic, paralinguistic, and non-linguistic cues that are vital for effective human-computer interaction. The rapid advancement of large language models (LLMs) has catalyzed the emergence of Speech Large Language Models (Speech LLMs), which marks a transformative shift toward general-purpose speech understanding systems. To further clarify and systematically delineate task objectives, in this paper, we formally define the concept of speech understanding and introduce a structured taxonomy encompassing its informational, functional, and format dimensions. Within this scope of definition, we present a comprehensive review of current Speech LLMs, analyzing their architectures through a three-stage abstraction: Modality Feature Extraction, Modality Information Fusion, and LLM Inference. In addition, we examine training strategies, discuss representative datasets, an
Authors
(none)
Tags
Stats
Related papers
- Recent Advances In Speech Language Models: A Survey (2024)14.64
- Roadmap Towards Superhuman Speech Understanding Using Large Language Models (2024)0.00
- Closing The Gap Between Text And Speech Understanding In Llms (2025)0.00
- On The Landscape Of Spoken Language Models: A Comprehensive Survey (2025)0.00
- Speechverse: A Large-scale Generalizable Audio Language Model (2024)0.00
- A Review Of Multi-modal Large Language And Vision Models (2024)0.00
- Discreteslu: A Large Language Model With Self-supervised Discrete Speech Units For Spoken Language Understanding (2024)5.84
- Towards Holistic Evaluation Of Large Audio-language Models: A Comprehensive Survey (2026)9.75