STEMM: Self-learning With Speech-text Manifold Mixup For Speech Translation
2022 Β· Qingkai Fang, Rong Ye, Lei Li, et al.
Abstract
How to learn a better speech representation for end-to-end speech-to-text translation (ST) with limited labeled data? Existing techniques often attempt to transfer powerful machine translation (MT) capabilities to ST, but neglect the representation discrepancy across modalities. In this paper, we propose the Speech-TExt Manifold Mixup (STEMM) method to calibrate such discrepancy. Specifically, we mix up the representation sequences of different modalities, and take both unimodal speech sequences and multimodal mixed sequences as input to the translation model in parallel, and regularize their output predictions with a self-learning framework. Experiments on MuST-C speech translation benchmark and further analysis show that our method effectively alleviates the cross-modal representation discrepancy, and achieves significant improvements over a strong baseline on eight translation directions.
Authors
(none)
Tags
Stats
Related papers
- Bridging The Modality Gap For Speech-to-text Translation (2020)0.00
- Rethinking And Improving Multi-task Learning For End-to-end Speech Translation (2023)5.84
- Data Efficient Direct Speech-to-text Translation With Modality Agnostic Meta-learning (2019)0.00
- Multilingual End-to-end Speech Translation (2019)0.00
- Mixspeech: Cross-modality Self-learning With Audio-visual Stream Mixup For Visual Speech Translation And Recognition (2023)8.60
- Leveraging Weakly Supervised Data To Improve End-to-end Speech-to-text Translation (2018)13.05
- MAM: Masked Acoustic Modeling For End-to-end Speech-to-text Translation (2020)0.00
- Enhanced Direct Speech-to-speech Translation Using Self-supervised Pre-training And Data Augmentation (2022)10.85