Discreteslu: A Large Language Model With Self-supervised Discrete Speech Units For Spoken Language Understanding
2024 · Suwon Shon, Kwangyoun Kim, Yi-Te Hsu, et al.
Abstract
The integration of pre-trained text-based large language models (LLM) with speech input has enabled instruction-following capabilities for diverse speech tasks. This integration requires the use of a speech encoder, a speech adapter, and an LLM, trained on diverse tasks. We propose the use of discrete speech units (DSU), rather than continuous-valued speech encoder outputs, that are converted to the LLM token embedding space using the speech adapter. We generate DSU using a self-supervised speech encoder followed by k-means clustering. The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering. We also explore various types of DSU extracted from different layers of the self-supervised speech encoder, as well as Mel frequency Cepstral Coefficients (MFCC). Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
Authors
(none)
Tags
Stats
Related papers
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00
- Large Language Model Guided Decoding For Self-supervised Speech Recognition (2025)0.00
- Paralinguistics-aware Speech-empowered Large Language Models For Natural Conversation (2024)3.96
- Investigating Decoder-only Large Language Models For Speech-to-text Translation (2024)0.00
- A Survey On Speech Large Language Models For Understanding (2024)4.52
- On Decoder-only Architecture For Speech-to-text And Large Language Model Integration (2023)0.00
- A Study On The Integration Of Pre-trained SSL, ASR, LM And SLU Models For Spoken Language Understanding (2022)8.09
- Enhancing The Stability Of Llm-based Speech Generation Systems Through Self-supervised Representations (2024)0.00