Investigating The Emergent Audio Classification Ability Of ASR Foundation Models
2023 Β· Rao Ma, Adian Liusie, Mark J. F. Gales, et al.
Abstract
Text and vision foundation models can perform many tasks in a zero-shot setting, a desirable property that enables these systems to be applied in general and low-resource settings. There has been far less work, however, on the zero-shot abilities of ASR foundation models, with these systems typically fine-tuned to specific tasks or constrained to applications that match their training criterion and data annotation. In this work we investigate the ability of Whisper and MMS, ASR foundation models trained primarily for speech recognition, to perform zero-shot audio classification. We use simple template-based text prompts at the decoder and use the resulting decoding probabilities to generate zero-shot predictions. Without training the model on extra data or adding any new parameters, we demonstrate that Whisper shows promising zero-shot classification performance on a range of 8 audio-classification datasets, outperforming the accuracy of existing state-of-the-art zero-shot baselines by
Authors
(none)
Tags
Stats
Related papers
- Probing The Hidden Talent Of ASR Foundation Models For L2 English Oral Assessment (2025)0.00
- Investigating Zero-shot Generalizability On Mandarin-english Code-switched ASR And Speech-to-text Translation Of Recent Foundation Models With Self-supervision And Weak Supervision (2023)0.00
- Whisper-lm: Improving ASR Models With Language Models For Low-resource Languages (2025)3.29
- Prompting The Hidden Talent Of Web-scale Speech Models For Zero-shot Task Generalization (2023)16.38
- A Study On Zero-shot Non-intrusive Speech Assessment Using Large Language Models (2024)5.84
- Resource-efficient Adaptation Of Speech Foundation Models For Multi-speaker ASR (2024)3.58
- Zero Shot Text To Speech Augmentation For Automatic Speech Recognition On Low-resource Accented Speech Corpora (2024)2.26
- Benchmarking Children's ASR With Supervised And Self-supervised Speech Foundation Models (2024)8.60