SALM: Speech-augmented Language Model With In-context Learning For Speech Recognition And Translation
2023 Β· Zhehuai Chen, He Huang, Andrei Andrusenko, et al.
Abstract
We present a novel Speech Augmented Language Model (SALM) with \{\em multitask\} and \{\em in-context\} learning capabilities. SALM comprises a frozen text LLM, a audio encoder, a modality adapter module, and LoRA layers to accommodate speech input and associated task instructions. The unified SALM not only achieves performance on par with task-specific Conformer baselines for Automatic Speech Recognition (ASR) and Speech Translation (AST), but also exhibits zero-shot in-context learning capabilities, demonstrated through keyword-boosting task for ASR and AST. Moreover, \{\em speech supervised in-context training\} is proposed to bridge the gap between LLM training and downstream speech tasks, which further boosts the in-context learning ability of speech-to-text models. Proposed model is open-sourced via NeMo toolkit.
Authors
(none)
Tags
Stats
Related papers
- SELMA: A Speech-enabled Language Model For Virtual Assistant Interactions (2025)2.26
- End-to-end Speech Recognition Contextualization With Large Language Models (2023)0.00
- Making Llms Better Many-to-many Speech-to-text Translators With Curriculum Learning (2024)7.31
- SELM: Speech Enhancement Using Discrete Tokens And Language Models (2023)11.19
- Salmonn-omni: A Codec-free LLM For Full-duplex Speech Understanding And Generation (2024)0.00
- CALM: Contrastive Aligned Audio-language Multirate And Multimodal Representations (2022)0.00
- Attention-based Contextual Language Model Adaptation For Speech Recognition (2021)0.00
- MATS: An Audio Language Model Under Text-only Supervision (2025)0.00