Prompting The Hidden Talent Of Web-scale Speech Models For Zero-shot Task Generalization
2023 Β· Puyuan Peng, Brian Yan, Shinji Watanabe, et al.
Abstract
We investigate the emergent abilities of the recently proposed web-scale speech model Whisper, by adapting it to unseen tasks with prompt engineering. We selected three tasks: audio-visual speech recognition (AVSR), code-switched speech recognition (CS-ASR), and speech translation (ST) on unseen language pairs. We design task-specific prompts, by either leveraging another large-scale model, or simply manipulating the special tokens in the default prompts. Experiments show that compared to the default prompts, our proposed prompts improve performance by 10% to 45% on the three zero-shot tasks, and even outperform SotA supervised models on some datasets. In addition, our experiments reveal many interesting properties of Whisper, including its robustness to prompts, bias on accents, and the multilingual understanding in its latent space. Code is available at https://github.com/jasonppy/PromptingWhisper
Authors
(none)
Tags
Stats
Code
Related papers
- Wav2prompt: End-to-end Speech Prompt Generation And Tuning For LLM In Zero And Few-shot Learning (2024)0.00
- Probing The Hidden Talent Of ASR Foundation Models For L2 English Oral Assessment (2025)0.00
- Extending Whisper With Prompt Tuning To Target-speaker ASR (2023)9.59
- A Study On Zero-shot Non-intrusive Speech Assessment Using Large Language Models (2024)5.84
- Investigating Zero-shot Generalizability On Mandarin-english Code-switched ASR And Speech-to-text Translation Of Recent Foundation Models With Self-supervision And Weak Supervision (2023)0.00
- WAVPROMPT: Towards Few-shot Spoken Language Understanding With Frozen Language Models (2022)11.98
- Multilingual Distilwhisper: Efficient Distillation Of Multi-task Speech Models Via Language-specific Experts (2023)8.09
- Speechgen: Unlocking The Generative Power Of Speech Language Models With Prompts (2023)0.00