Enhanced Direct Speech-to-speech Translation Using Self-supervised Pre-training And Data Augmentation
2022 Β· Sravya Popuri, Peng-Jen Chen, Changhan Wang, et al.
Abstract
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues as there exists little parallel S2ST data, compared to the amount of data available for conventional cascaded systems that consist of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) synthesis. In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue. We take advantage of a recently proposed speech-to-unit translation (S2UT) framework that encodes target speech into discrete representations, and transfer pre-training and efficient partial finetuning techniques that work well for speech-to-text translation (S2T) to the S2UT domain by studying both speech encoder and discrete unit decoder pre-training. Our experiments on Spanish-English translation show that self-supervised pre-training consistently improves model performance compared with multitask learning with an average 6.6-12.1 BLEU gain, and it ca
Authors
(none)
Tags
Stats
Related papers
- Direct Speech-to-speech Translation With Discrete Units (2021)13.97
- Leveraging Unsupervised And Weakly-supervised Data To Improve Direct Speech-to-speech Translation (2022)8.35
- Joint Pre-training With Speech And Bilingual Text For Direct Speech To Speech Translation (2022)7.81
- Preserving Speaker Information In Direct Speech-to-speech Translation With Non-autoregressive Generation And Pretraining (2024)0.00
- Leveraging Pseudo-labeled Data To Improve Direct Speech-to-speech Translation (2022)10.33
- Textless Speech-to-speech Translation On Real Data (2021)13.65
- Leveraging Weakly Supervised Data To Improve End-to-end Speech-to-text Translation (2018)13.05
- Unity: Two-pass Direct Speech-to-speech Translation With Discrete Units (2022)9.59