MAM: Masked Acoustic Modeling For End-to-end Speech-to-text Translation
2020 Β· Junkun Chen, Mingbo Ma, Renjie Zheng, et al.
Abstract
End-to-end Speech-to-text Translation (E2E-ST), which directly translates source language speech to target language text, is widely useful in practice, but traditional cascaded approaches (ASR+MT) often suffer from error propagation in the pipeline. On the other hand, existing end-to-end solutions heavily depend on the source language transcriptions for pre-training or multi-task training with Automatic Speech Recognition (ASR). We instead propose a simple technique to learn a robust speech encoder in a self-supervised fashion only on the speech side, which can utilize speech data without transcription. This technique termed Masked Acoustic Modeling (MAM), not only provides an alternative solution to improving E2E-ST, but also can perform pre-training on any acoustic signals (including non-speech ones) without annotation. We conduct our experiments over 8 different translation directions. In the setting without using any transcriptions, our technique achieves an average improvement of
Authors
(none)
Tags
Stats
Related papers
- Bridging The Modality Gap For Speech-to-text Translation (2020)0.00
- Leveraging Weakly Supervised Data To Improve End-to-end Speech-to-text Translation (2018)13.05
- When End-to-end Is Overkill: Rethinking Cascaded Speech-to-text Translation (2025)0.00
- Stacked Acoustic-and-textual Encoding: Integrating The Pre-trained Models Into Speech Translation Encoders (2021)10.48
- Data Efficient Direct Speech-to-text Translation With Modality Agnostic Meta-learning (2019)0.00
- Direct Simultaneous Speech-to-text Translation Assisted By Synchronized Streaming ASR (2021)6.77
- Harnessing Indirect Training Data For End-to-end Automatic Speech Translation: Tricks Of The Trade (2019)0.00
- Tight Integrated End-to-end Training For Cascaded Speech Translation (2020)8.35