Can We Achieve High-quality Direct Speech-to-speech Translation Without Parallel Speech Data?
2024 Β· Qingkai Fang, Shaolei Zhang, Zhengrui Ma, et al.
Abstract
Recently proposed two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS) within an end-to-end model, yielding promising results. However, the training of these models still relies on parallel speech data, which is extremely challenging to collect. In contrast, S2TT and TTS have accumulated a large amount of data and pretrained models, which have not been fully utilized in the development of S2ST models. Inspired by this, in this paper, we first introduce a composite S2ST model named ComSpeech, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model. Furthermore, to eliminate the reliance on parallel speech data, we propose a novel training method ComSpeech-ZS that solely utilizes S2TT and TTS data. It aligns representations in the latent space through contrastive learning, enabling the speech synthesis capability learned from the TTS data to generalize to S2ST in a
Authors
(none)
Tags
Stats
Related papers
- Joint Pre-training With Speech And Bilingual Text For Direct Speech To Speech Translation (2022)7.81
- Leveraging Unsupervised And Weakly-supervised Data To Improve Direct Speech-to-speech Translation (2022)8.35
- Textless Direct Speech-to-speech Translation With Discrete Speech Representation (2022)9.76
- Textless Speech-to-speech Translation With Limited Parallel Data (2023)3.58
- Styles2st: Zero-shot Style Transfer For Direct Speech-to-speech Translation (2023)0.00
- Direct Speech-to-speech Translation With Discrete Units (2021)13.97
- Leveraging Weakly Supervised Data To Improve End-to-end Speech-to-text Translation (2018)13.05
- Enhanced Direct Speech-to-speech Translation Using Self-supervised Pre-training And Data Augmentation (2022)10.85