Recent Advances In Speech Language Models: A Survey
2024 Β· Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, et al.
Abstract
Large Language Models (LLMs) have recently garnered significant attention, primarily for their capabilities in text-based interactions. However, natural human interaction often relies on speech, necessitating a shift towards voice-based models. A straightforward approach to achieve this involves a pipeline of ``Automatic Speech Recognition (ASR) + LLM + Text-to-Speech (TTS)", where input speech is transcribed to text, processed by an LLM, and then converted back to speech. Despite being straightforward, this method suffers from inherent limitations, such as information loss during modality conversion, significant latency due to the complex pipeline, and error accumulation across the three stages. To address these issues, Speech Language Models (SpeechLMs) -- end-to-end models that generate speech without converting from text -- have emerged as a promising alternative. This survey paper provides the first comprehensive overview of recent methodologies for constructing SpeechLMs, detaili
Authors
(none)
Tags
Stats
Related papers
- A Survey On Speech Large Language Models For Understanding (2024)4.52
- Exploring The Integration Of Large Language Models Into Automatic Speech Recognition Systems: An Empirical Study (2023)8.09
- On The Landscape Of Spoken Language Models: A Comprehensive Survey (2025)0.00
- Roadmap Towards Superhuman Speech Understanding Using Large Language Models (2024)0.00
- Large Language Model Can Transcribe Speech In Multi-talker Scenarios With Versatile Instructions (2024)11.23
- Boosting Large Language Model For Speech Synthesis: An Empirical Study (2023)6.77
- A Review Of Multi-modal Large Language And Vision Models (2024)0.00
- Get Large Language Models Ready To Speak: A Late-fusion Approach For Speech Generation (2024)5.24