Direct Speech-to-speech Translation With Discrete Units
2021 Β· Ann Lee, Peng-Jen Chen, Changhan Wang, et al.
Abstract
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. We tackle the problem by first applying a self-supervised discrete speech encoder on the target speech and then training a sequence-to-sequence speech-to-unit translation (S2UT) model to predict the discrete representations of the target speech. When target text transcripts are available, we design a joint speech and text training framework that enables the model to generate dual modality output (speech and text) simultaneously in the same inference pass. Experiments on the Fisher Spanish-English dataset show that the proposed framework yields improvement of 6.7 BLEU compared with a baseline direct S2ST model that predicts spectrogram features. When trained without any text transcripts, our model performance is comparable to models that predict spectrograms and are trained with text supervision, showing the
Authors
(none)
Tags
Stats
Related papers
- Enhanced Direct Speech-to-speech Translation Using Self-supervised Pre-training And Data Augmentation (2022)10.85
- Textless Direct Speech-to-speech Translation With Discrete Speech Representation (2022)9.76
- Unity: Two-pass Direct Speech-to-speech Translation With Discrete Units (2022)9.59
- Joint Pre-training With Speech And Bilingual Text For Direct Speech To Speech Translation (2022)7.81
- Textless Speech-to-speech Translation On Real Data (2021)13.65
- Preserving Speaker Information In Direct Speech-to-speech Translation With Non-autoregressive Generation And Pretraining (2024)0.00
- Transpeech: Speech-to-speech Translation With Bilateral Perturbation (2022)0.00
- Speech-to-speech Translation With Discrete-unit-based Style Transfer (2023)0.00