Direct Speech-to-speech Translation With A Sequence-to-sequence Model
2019 Β· Ye Jia, Ron J. Weiss, Fadi Biadsy, et al.
Abstract
We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation. The network is trained end-to-end, learning to map speech spectrograms into target spectrograms in another language, corresponding to the translated content (in a different canonical voice). We further demonstrate the ability to synthesize translated speech using the voice of the source speaker. We conduct experiments on two Spanish-to-English speech translation datasets, and find that the proposed model slightly underperforms a baseline cascade of a direct speech-to-text translation model and a text-to-speech synthesis model, demonstrating the feasibility of the approach on this very challenging task.
Authors
(none)
Tags
Stats
Related papers
- Translatotron 2: High-quality Direct Speech-to-speech Translation With Voice Preservation (2021)0.00
- End-to-end Spoken Language Translation (2019)0.00
- Direct Speech-to-speech Translation With Discrete Units (2021)13.97
- Textless Direct Speech-to-speech Translation With Discrete Speech Representation (2022)9.76
- Translatotron 3: Speech To Speech Translation With Monolingual Data (2023)8.09
- Towards Better Decoding And Language Model Integration In Sequence To Sequence Models (2016)15.67
- Tight Integrated End-to-end Training For Cascaded Speech Translation (2020)8.35
- Long-form End-to-end Speech Translation Via Latent Alignment Segmentation (2023)0.00