Tight Integrated End-to-end Training For Cascaded Speech Translation
2020 · Parnia Bahar, Tobias Bieschke, Ralf Schlüter, et al.
Abstract
A cascaded speech translation model relies on discrete and non-differentiable transcription, which provides a supervision signal from the source side and helps the transformation between source speech and target text. Such modeling suffers from error propagation between ASR and MT models. Direct speech translation is an alternative method to avoid error propagation; however, its performance is often behind the cascade system. To use an intermediate representation and preserve the end-to-end trainability, previous studies have proposed using two-stage models by passing the hidden vectors of the recognizer into the decoder of the MT model and ignoring the MT encoder. This work explores the feasibility of collapsing the entire cascade components into a single end-to-end trainable model by optimizing all parameters of ASR and MT models jointly without ignoring any learned parameters. It is a tightly integrated method that passes renormalized source word posterior distributions as a soft de
Authors
(none)
Tags
Stats
Related papers
- When End-to-end Is Overkill: Rethinking Cascaded Speech-to-text Translation (2025)0.00
- Cascaded Models With Cyclic Feedback For Direct Speech Translation (2020)5.24
- Improving Cascaded Unsupervised Speech Translation With Denoising Back-translation (2023)0.00
- Leveraging Weakly Supervised Data To Improve End-to-end Speech-to-text Translation (2018)13.05
- Synchronous Speech Recognition And Speech-to-text Translation With Interactive Decoding (2019)10.48
- Transformer-based Cascaded Multimodal Speech Translation (2019)0.00
- Stacked Acoustic-and-textual Encoding: Integrating The Pre-trained Models Into Speech Translation Encoders (2021)10.48
- Textless Streaming Speech-to-speech Translation Using Semantic Speech Tokens (2024)3.58