Hearing To Translate: The Effectiveness Of Speech Modality Integration Into Llms
2026 Β· Sara Papi, Javier Garcia Gilabert, Zachary Hopton, et al.
Abstract
arXiv:2512.16378v4 Announce Type: replace-cross Abstract: As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which directly process spoken language and enable speech-to-text translation (ST) and other downstream tasks, bypassing traditional transcription-based pipelines. Whether this integration improves ST quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 6 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable solution overall, but most recent SpeechLLMs can match or even outperf
Authors
(none)
Tags
Stats
Related papers
- Blending Llms Into Cascaded Speech Translation: Kit's Offline Speech Translation System For IWSLT 2024 (2024)0.00
- Investigating Decoder-only Large Language Models For Speech-to-text Translation (2024)0.00
- Recent Advances In Speech Language Models: A Survey (2024)14.64
- Zero-resource Speech Translation And Recognition With Llms (2024)3.58
- Exploring The Integration Of Large Language Models Into Automatic Speech Recognition Systems: An Empirical Study (2023)8.09
- Llast: Improved End-to-end Speech Translation System Leveraged By Large Language Models (2024)10.67
- Ideal-llm: Integrating Dual Encoders And Language-adapted LLM For Multilingual Speech-to-text (2024)5.24
- Towards Achieving Human Parity On End-to-end Simultaneous Speech Translation Via LLM Agent (2024)0.00