When End-to-end Is Overkill: Rethinking Cascaded Speech-to-text Translation
2025 Β· Anna Min, Chenxu Hu, Yi Ren, et al.
Abstract
Though end-to-end speech-to-text translation has been a great success, we argue that the cascaded speech-to-text translation model still has its place, which is usually criticized for the error propagation between automatic speech recognition (ASR) and machine translation (MT) models. In this paper, we explore the benefits of incorporating multiple candidates from ASR and self-supervised speech features into MT. Our analysis reveals that the primary cause of cascading errors stems from the increased divergence between similar samples in the speech domain when mapped to the text domain. By including multiple candidates and self-supervised speech features, our approach allows the machine translation model to choose the right words and ensure precise translation using various speech samples. This strategy minimizes error spread and takes advantage of large ASR and MT datasets, along with pre-trained ASR/MT models, while addressing associated issues.
Authors
(none)
Tags
Stats
Related papers
- Tight Integrated End-to-end Training For Cascaded Speech Translation (2020)8.35
- Cascaded Models With Cyclic Feedback For Direct Speech Translation (2020)5.24
- Leveraging Weakly Supervised Data To Improve End-to-end Speech-to-text Translation (2018)13.05
- Speech Translation And The End-to-end Promise: Taking Stock Of Where We Are (2020)11.93
- Harnessing Indirect Training Data For End-to-end Automatic Speech Translation: Tricks Of The Trade (2019)0.00
- Improving Cascaded Unsupervised Speech Translation With Denoising Back-translation (2023)0.00
- MAM: Masked Acoustic Modeling For End-to-end Speech-to-text Translation (2020)0.00
- End-to-end Speech-to-text Translation: A Survey (2023)0.00