Comsl: A Composite Speech-language Model For End-to-end Speech-to-text Translation
2023 Β· Chenyang Le, Yao Qian, Long Zhou, et al.
Abstract
Joint speech-language training is challenging due to the large demand for training data and GPU consumption, as well as the modality gap between speech and language. We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models and optimized data-efficiently for spoken language tasks. Particularly, we propose to incorporate cross-modality learning into transfer learning and conduct them simultaneously for downstream tasks in a multi-task learning manner. Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks, achieving a new state-of-the-art average BLEU score of 31.5 on the multilingual speech to English text translation task for 21 languages, as measured on the public CoVoST2 evaluation set.
Authors
(none)
Tags
Stats
Related papers
- One-to-many Multilingual End-to-end Speech Translation (2019)9.23
- SLAM: A Unified Encoder For Speech And Language Modeling Via Speech-text Joint Pre-training (2021)0.00
- Can We Achieve High-quality Direct Speech-to-speech Translation Without Parallel Speech Data? (2024)2.26
- SLM-S2ST: A Multimodal Language Model For Direct Speech-to-speech Translation (2025)0.00
- Making Llms Better Many-to-many Speech-to-text Translators With Curriculum Learning (2024)7.31
- Speech-language Pre-training For End-to-end Spoken Language Understanding (2021)9.41
- Llast: Improved End-to-end Speech Translation System Leveraged By Large Language Models (2024)10.67
- Leveraging Weakly Supervised Data To Improve End-to-end Speech-to-text Translation (2018)13.05