Prompting Large Language Models With Audio For General-purpose Speech Summarization
2024 Β· Wonjune Kang, Deb Roy
Abstract
In this work, we introduce a framework for speech summarization that leverages the processing and reasoning capabilities of large language models (LLMs). We propose an end-to-end system that combines an instruction-tuned LLM with an audio encoder that converts speech into token representations that the LLM can interpret. Using a dataset with paired speech-text data, the overall system is trained to generate consistent responses to prompts with the same semantic information regardless of the input modality. The resulting framework allows the LLM to process speech inputs in the same way as text, enabling speech summarization by simply prompting the LLM. Unlike prior approaches, our method is able to summarize spoken content from any arbitrary domain, and it can produce summaries in different styles by varying the LLM prompting strategy. Experiments demonstrate that our approach outperforms a cascade baseline of speech recognition followed by LLM text processing.
Authors
(none)
Tags
Stats
Related papers
- Realizing Video Summarization From The Path Of Language-based Semantic Understanding (2024)0.00
- Audiochatllama: Towards General-purpose Speech Abilities For Llms (2023)9.41
- Augsumm: Towards Generalizable Speech Summarization Using Synthetic Labels From Large Language Model (2024)4.53
- Chain-of-thought Prompting For Speech Translation (2024)6.34
- End-to-end Speech Recognition Contextualization With Large Language Models (2023)0.00
- Large Language Model Can Transcribe Speech In Multi-talker Scenarios With Versatile Instructions (2024)11.23
- Paralinguistics-aware Speech-empowered Large Language Models For Natural Conversation (2024)3.96
- Paralinguistics-enhanced Large Language Modeling Of Spoken Dialogue (2023)0.00