Internalizing ASR With Implicit Chain Of Thought For Efficient Speech-to-speech Conversational LLM
2024 Β· Robin Shing-Hei Yuen, Timothy Tin-Long Tse, Jian Zhu
Abstract
Current speech-based LLMs are predominantly trained on extensive ASR and TTS datasets, excelling in tasks related to these domains. However, their ability to handle direct speech-to-speech conversations remains notably constrained. These models often rely on an ASR-to-TTS chain-of-thought pipeline, converting speech into text for processing before generating audio responses, which introduces latency and loses audio features. We propose a method that implicitly internalizes ASR chain of thought into a speech LLM, enhancing its native speech understanding capabilities. Our approach reduces latency and improves the model's native understanding of speech, paving the way for more efficient and natural real-time audio interactions. We also release a large-scale synthetic conversational dataset to facilitate further research.
Authors
(none)
Tags
Stats
Related papers
- Chain-of-thought Prompting For Speech Translation (2024)6.34
- Intrinsicvoice: Empowering Llms With Intrinsic Real-time Voice Interaction Abilities (2024)0.00
- Spoken Conversational Agents With Large Language Models (2025)0.00
- Listening While Speaking And Visualizing: Improving ASR Through Multimodal Chain (2019)4.52
- Towards ASR Robust Spoken Language Understanding Through In-context Learning With Word Confusion Networks (2024)0.00
- Recent Advances In Speech Language Models: A Survey (2024)14.64
- Exploring The Integration Of Large Language Models Into Automatic Speech Recognition Systems: An Empirical Study (2023)8.09
- Thinking In Cocktail Party: Chain-of-thought And Reinforcement Learning For Target Speaker Automatic Speech Recognition (2025)0.00