Transformer-based Cascaded Multimodal Speech Translation
2019 Β· Zixiu Wu, Ozan Caglayan, Julia Ive, et al.
Abstract
This paper describes the cascaded multimodal speech translation systems developed by Imperial College London for the IWSLT 2019 evaluation campaign. The architecture consists of an automatic speech recognition (ASR) system followed by a Transformer-based multimodal machine translation (MMT) system. While the ASR component is identical across the experiments, the MMT model varies in terms of the way of integrating the visual context (simple conditioning vs. attention), the type of visual features exploited (pooled, convolutional, action categories) and the underlying architecture. For the latter, we explore both the canonical transformer and its deliberation version with additive and cascade variants which differ in how they integrate the textual attention. Upon conducting extensive experiments, we found that (i) the explored visual integration schemes often harm the translation performance for the transformer and additive deliberation, but considerably improve the cascade deliberation;
Authors
(none)
Tags
Stats
Related papers
- Dual-decoder Transformer For Joint Automatic Speech Recognition And Multilingual Speech Translation (2020)13.73
- Tight Integrated End-to-end Training For Cascaded Speech Translation (2020)8.35
- Blending Llms Into Cascaded Speech Translation: Kit's Offline Speech Translation System For IWSLT 2024 (2024)0.00
- Cascaded Models With Cyclic Feedback For Direct Speech Translation (2020)5.24
- When End-to-end Is Overkill: Rethinking Cascaded Speech-to-text Translation (2025)0.00
- Improving Cascaded Unsupervised Speech Translation With Denoising Back-translation (2023)0.00
- Multimodal Machine Translation Through Visuals And Speech (2019)12.68
- Hearing To Translate: The Effectiveness Of Speech Modality Integration Into Llms (2026)0.00