MCAT: Scaling Many-to-many Speech-to-text Translation With Mllms To 70 Languages
2025 Β· Yexing Du, Kaiyuan Liu, Youcheng Pan, et al.
Abstract
Multimodal Large Language Models (MLLMs) have achieved great success in Speech-to-Text Translation (S2TT) tasks. However, current research is constrained by two key challenges: language coverage and efficiency. Most of the popular S2TT datasets are substantially English-centric, which restricts the scaling-up of MLLMs' many-to-many translation capabilities. Moreover, the inference speed of MLLMs degrades dramatically when the speech is converted into long sequences (e.g., 750 tokens). To address these limitations, we propose a Multilingual Cost-effective Accelerated Speech-to-Text Translator (MCAT) framework, which includes two innovations. First, a language scaling method that leverages curriculum learning and a data balancing strategy is introduced to extend the language coverage supported by MLLMs to 70 languages and achieve mutual translation among these languages. Second, an optimized speech adapter module is designed to reduce the length of the speech sequence to only 30 tokens.
Authors
(none)
Tags
Stats
Related papers
- Making Llms Better Many-to-many Speech-to-text Translators With Curriculum Learning (2024)7.31
- SLM-S2ST: A Multimodal Language Model For Direct Speech-to-speech Translation (2025)0.00
- Enhancing Code-switched Text-to-speech Synthesis Capability In Large Language Models With Only Monolingual Corpora (2024)0.00
- Mixture-of-transformers: A Sparse And Scalable Architecture For Multi-modal Foundation Models (2024)0.00
- Hearing To Translate: The Effectiveness Of Speech Modality Integration Into Llms (2026)0.00
- Zero-resource Speech Translation And Recognition With Llms (2024)3.58
- MCIF: Multimodal Crosslingual Instruction-following Benchmark From Scientific Talks (2025)0.00
- Teaching A Multilingual Large Language Model To Understand Multilingual Speech Via Multi-instructional Training (2024)0.00