Revisiting Direct Speech-to-text Translation With Speech Llms: Better Scaling Than Cot Prompting?
2025 · Oriol Pareras, Gerard I. Gállego, Federico Costa, et al.
Abstract
Recent work on Speech-to-Text Translation (S2TT) has focused on LLM-based models, introducing the increasingly adopted Chain-of-Thought (CoT) prompting, where the model is guided to first transcribe the speech and then translate it. CoT typically outperforms direct prompting primarily because it can exploit abundant Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) datasets to explicitly model its steps. In this paper, we systematically compare CoT and Direct prompting under increasing amounts of S2TT data. To this end, we pseudo-label an ASR corpus by translating its transcriptions into six European languages, and train LLM-based S2TT systems with both prompting strategies at different data scales. Our results show that Direct improves more consistently as the amount of data increases, suggesting that it may become a more effective approach as larger S2TT resources are created.
Authors
(none)
Tags
Stats
Related papers
- Chain-of-thought Prompting For Speech Translation (2024)6.34
- SLM-S2ST: A Multimodal Language Model For Direct Speech-to-speech Translation (2025)0.00
- Leveraging Unsupervised And Weakly-supervised Data To Improve Direct Speech-to-speech Translation (2022)8.35
- Effective Text Adaptation For Llm-based ASR Through Soft Prompt Fine-tuning (2024)5.84
- Investigating Decoder-only Large Language Models For Speech-to-text Translation (2024)0.00
- Making Llms Better Many-to-many Speech-to-text Translators With Curriculum Learning (2024)7.31
- MCAT: Scaling Many-to-many Speech-to-text Translation With Mllms To 70 Languages (2025)2.41
- Simuls2s-llm: Unlocking Simultaneous Inference Of Speech Llms For Speech-to-speech Translation (2025)3.58