Roadmap Towards Superhuman Speech Understanding Using Large Language Models
2024 Β· Fan Bu, Yuhao Zhang, Xidong Wang, et al.
Abstract
The success of large language models (LLMs) has prompted efforts to integrate speech and audio data, aiming to create general foundation models capable of processing both textual and non-textual inputs. Recent advances, such as GPT-4o, highlight the potential for end-to-end speech LLMs, which preserves non-semantic information and world knowledge for deeper speech understanding. To guide the development of speech LLMs, we propose a five-level roadmap, ranging from basic automatic speech recognition (ASR) to advanced superhuman models capable of integrating non-semantic information with abstract acoustic knowledge for complex tasks. Moreover, we design a benchmark, SAGI Bechmark, that standardizes critical aspects across various tasks in these five levels, uncovering challenges in using abstract acoustic knowledge and completeness of capability. Our findings reveal gaps in handling paralinguistic cues and abstract acoustic knowledge, and we offer future directions. This paper outlines a
Authors
(none)
Tags
Stats
Related papers
- A Survey On Speech Large Language Models For Understanding (2024)4.52
- Recent Advances In Speech Language Models: A Survey (2024)14.64
- Exploring The Integration Of Large Language Models Into Automatic Speech Recognition Systems: An Empirical Study (2023)8.09
- Spoken Conversational Agents With Large Language Models (2025)0.00
- Towards Holistic Evaluation Of Large Audio-language Models: A Comprehensive Survey (2026)9.75
- Paralinguistics-enhanced Large Language Modeling Of Spoken Dialogue (2023)0.00
- Speechverse: A Large-scale Generalizable Audio Language Model (2024)0.00
- Integrating Pre-trained Speech And Language Models For End-to-end Speech Recognition (2023)0.00