Bridging The Modality Gap For Speech-to-text Translation
2020 Β· Yuchen Liu, Junnan Zhu, Jiajun Zhang, et al.
Abstract
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way. Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously, which ignores the speech-and-text modality differences and makes the encoder overloaded, leading to great difficulty in learning such a model. To address these issues, we propose a Speech-to-Text Adaptation for Speech Translation (STAST) model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text. Specifically, we decouple the speech translation encoder into three parts and introduce a shrink mechanism to match the length of speech representation with that of the corresponding text transcription. To obtain better semantic representation, we completely integrate a text-based translation model into the STAST so that two tasks can be trained in the same latent sp
Authors
(none)
Tags
Stats
Related papers
- Data Efficient Direct Speech-to-text Translation With Modality Agnostic Meta-learning (2019)0.00
- Synchronous Speech Recognition And Speech-to-text Translation With Interactive Decoding (2019)10.48
- Adatrans: Adapting With Boundary-based Shrinking For End-to-end Speech Translation (2022)0.00
- Stacked Acoustic-and-textual Encoding: Integrating The Pre-trained Models Into Speech Translation Encoders (2021)10.48
- Multilingual End-to-end Speech Translation (2019)0.00
- Leveraging Weakly Supervised Data To Improve End-to-end Speech-to-text Translation (2018)13.05
- STEMM: Self-learning With Speech-text Manifold Mixup For Speech Translation (2022)11.58
- MAM: Masked Acoustic Modeling For End-to-end Speech-to-text Translation (2020)0.00