Speechverse: A Large-scale Generalizable Audio Language Model
2024 Β· Nilaksh Das, Saket Dingliwal, Srikanth Ronanki, et al.
Abstract
Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore develop SpeechVerse, a robust multi-task training and curriculum learning framework that combines pre-trained speech and text foundation models via a small set of learnable parameters, while keeping the pre-trained models frozen during training. The models are instruction finetuned using continuous latent representations extracted from the speech foundation model to achieve optimal zero-shot performance on a diverse range of speech processing tasks using natural language instructions. We perform extensive benchmarking that includes comparing our model performance against traditional baselines acros
Authors
(none)
Tags
Stats
Related papers
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59
- A Survey On Speech Large Language Models For Understanding (2024)4.52
- Investigating Decoder-only Large Language Models For Speech-to-text Translation (2024)0.00
- Teaching A Multilingual Large Language Model To Understand Multilingual Speech Via Multi-instructional Training (2024)0.00
- Discreteslu: A Large Language Model With Self-supervised Discrete Speech Units For Spoken Language Understanding (2024)5.84
- Prompting Large Language Models With Audio For General-purpose Speech Summarization (2024)6.34
- Omni-avsr: Towards Unified Multimodal Speech Recognition With Large Language Models (2025)2.26
- Roadmap Towards Superhuman Speech Understanding Using Large Language Models (2024)0.00