Making Llms Better Many-to-many Speech-to-text Translators With Curriculum Learning
2024 Β· Yexing Du, Youcheng Pan, Ziyang Ma, et al.
Abstract
Multimodal Large Language Models (MLLMs) have achieved significant success in Speech-to-Text Translation (S2TT) tasks. While most existing research has focused on English-centric translation directions, the exploration of many-to-many translation is still limited by the scarcity of parallel data. To address this, we propose a three-stage curriculum learning strategy that leverages the machine translation capabilities of large language models and adapts them to S2TT tasks, enabling effective learning in low-resource settings. We trained MLLMs with varying parameter sizes (3B, 7B, and 32B) and evaluated the proposed strategy using the FLEURS and CoVoST-2 datasets. Experimental results show that the proposed strategy achieves state-of-the-art average performance in \(15\times14\) language pairs, requiring fewer than 10 hours of speech data per language to achieve competitive results. The source code and models are released at https://github.com/yxduir/LLM-SRT.
Authors
(none)
Tags
Stats
Code
Related papers
- MCAT: Scaling Many-to-many Speech-to-text Translation With Mllms To 70 Languages (2025)2.41
- Investigating Decoder-only Large Language Models For Speech-to-text Translation (2024)0.00
- Teaching A Multilingual Large Language Model To Understand Multilingual Speech Via Multi-instructional Training (2024)0.00
- Zero-resource Speech Translation And Recognition With Llms (2024)3.58
- One-to-many Multilingual End-to-end Speech Translation (2019)9.23
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00
- SLM-S2ST: A Multimodal Language Model For Direct Speech-to-speech Translation (2025)0.00
- Enhancing Code-switched Text-to-speech Synthesis Capability In Large Language Models With Only Monolingual Corpora (2024)0.00