Teaching A Multilingual Large Language Model To Understand Multilingual Speech Via Multi-instructional Training
2024 Β· Pavel Denisov, Ngoc Thang Vu
Abstract
Recent advancements in language modeling have led to the emergence of Large Language Models (LLMs) capable of various natural language processing tasks. Despite their success in text-based tasks, applying LLMs to the speech domain remains limited and challenging. This paper presents BLOOMZMMS, a novel model that integrates a multilingual LLM with a multilingual speech encoder, aiming to harness the capabilities of LLMs for speech recognition and beyond. Utilizing a multi-instructional training approach, we demonstrate the transferability of linguistic knowledge from the text to the speech modality. Our experiments, conducted on 1900 hours of transcribed data from 139 languages, establish that a multilingual speech representation can be effectively learned and aligned with a multilingual LLM. While this learned representation initially shows limitations in task generalization, we address this issue by generating synthetic targets in a multi-instructional style. Our zero-shot evaluation
Authors
(none)
Tags
Stats
Related papers
- Making Llms Better Many-to-many Speech-to-text Translators With Curriculum Learning (2024)7.31
- Large Language Model Can Transcribe Speech In Multi-talker Scenarios With Versatile Instructions (2024)11.23
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00
- Zero-resource Speech Translation And Recognition With Llms (2024)3.58
- Training-free Multimodal Large Language Model Orchestration (2025)0.00
- X-LLM: Bootstrapping Advanced Large Language Models By Treating Multi-modalities As Foreign Languages (2023)0.00
- Uniaudio 1.5: Large Language Model-driven Audio Codec Is A Few-shot Audio Task Learner (2024)0.00
- Speechverse: A Large-scale Generalizable Audio Language Model (2024)0.00