WAVPROMPT: Towards Few-shot Spoken Language Understanding With Frozen Language Models
2022 Β· Heting Gao, Junrui Ni, Kaizhi Qian, et al.
Abstract
Large-scale auto-regressive language models pretrained on massive text have demonstrated their impressive ability to perform new natural language tasks with only a few text examples, without the need for fine-tuning. Recent studies further show that such a few-shot learning ability can be extended to the text-image setting by training an encoder to encode the images into embeddings functioning like the text embeddings of the language model. Interested in exploring the possibility of transferring the few-shot learning ability to the audio-text setting, we propose a novel speech understanding framework, WavPrompt, where we finetune a wav2vec model to generate a sequence of audio embeddings understood by the language model. We show that WavPrompt is a few-shot learner that can perform speech understanding tasks better than a naive text baseline. We conduct detailed ablation studies on different components and hyperparameters to empirically identify the best model configuration. In additio
Authors
(none)
Tags
Stats
Related papers
- Wav2prompt: End-to-end Speech Prompt Generation And Tuning For LLM In Zero And Few-shot Learning (2024)0.00
- Prompting The Hidden Talent Of Web-scale Speech Models For Zero-shot Task Generalization (2023)16.38
- Frozen Large Language Models Can Perceive Paralinguistic Aspects Of Speech (2024)6.34
- Voiceprompter: Robust Zero-shot Voice Conversion With Voice Prompt And Conditional Flow Matching (2025)3.58
- Wavthruvec: Latent Speech Representation As Intermediate Features For Neural Speech Synthesis (2022)10.07
- Sentence Embedder Guided Utterance Encoder (SEGUE) For Spoken Language Understanding (2023)3.58
- Avformer: Injecting Vision Into Frozen Speech Models For Zero-shot AV-ASR (2023)7.81
- Improving Language Model-based Zero-shot Text-to-speech Synthesis With Multi-scale Acoustic Prompts (2023)3.58