Wav2prompt: End-to-end Speech Prompt Generation And Tuning For LLM In Zero And Few-shot Learning
2024 Β· Keqi Deng, Guangzhi Sun, Philip C. Woodland
Abstract
Wav2Prompt is proposed which allows straightforward integration between spoken input and a text-based large language model (LLM). Wav2Prompt uses a simple training process with only the same data used to train an automatic speech recognition (ASR) model. After training, Wav2Prompt learns continuous representations from speech and uses them as LLM prompts. To avoid task over-fitting issues found in prior work and preserve the emergent abilities of LLMs, Wav2Prompt takes LLM token embeddings as the training targets and utilises a continuous integrate-and-fire mechanism for explicit speech-text alignment. Therefore, a Wav2Prompt-LLM combination can be applied to zero-shot spoken language tasks such as speech translation (ST), speech understanding (SLU), speech question answering (SQA) and spoken-query-based QA (SQQA). It is shown that for these zero-shot tasks, Wav2Prompt performs similarly to an ASR-LLM cascade and better than recent prior work. If relatively small amounts of task-specif
Authors
(none)
Tags
Stats
Related papers
- WAVPROMPT: Towards Few-shot Spoken Language Understanding With Frozen Language Models (2022)11.98
- Prompting Large Language Models For Zero-shot Domain Adaptation In Speech Recognition (2023)0.00
- Prompting The Hidden Talent Of Web-scale Speech Models For Zero-shot Task Generalization (2023)16.38
- Expressive Prompting: Improving Emotion Intensity And Speaker Consistency In Zero-shot TTS (2024)0.00
- Speechgen: Unlocking The Generative Power Of Speech Language Models With Prompts (2023)0.00
- Chain-of-thought Prompting For Speech Translation (2024)6.34
- Effective Text Adaptation For Llm-based ASR Through Soft Prompt Fine-tuning (2024)5.84
- Improving Language Model-based Zero-shot Text-to-speech Synthesis With Multi-scale Acoustic Prompts (2023)3.58