Textless Unit-to-unit Training For Many-to-many Multilingual Speech-to-speech Translation
2023 Β· Minsu Kim, Jeongsoo Choi, Dahun Kim, et al.
Abstract
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation that can also benefit the transfer of pre-trained knowledge to text-based systems, text-to-speech synthesis and text-to-speech translation. To this end, we represent multilingual speech with speech units that are the discretized representations of speech features derived from a self-supervised speech model. By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech, which can be easily associated with both speech and text modalities at the phonetic level information. By setting both the inputs and outputs of our learning problem as speech units, we propose to train an encoder-decoder model in a many-to-many spoken language translation setting, namely Unit-to-Unit Translation (UTUT). Specifically, the encoder is conditioned on the source language token to correctly understand the input spoken language, while the decoder is conditioned on the ta
Authors
(none)
Tags
Stats
Related papers
- Speechut: Bridging Speech And Text With Hidden-unit For Encoder-decoder Based Speech-text Pre-training (2022)10.74
- Direct Speech-to-speech Translation With Discrete Units (2021)13.97
- Learning To Speak From Text: Zero-shot Multilingual Text-to-speech With Unsupervised Text Pretraining (2023)8.82
- Analyzing Speech Unit Selection For Textless Speech-to-speech Translation (2024)0.00
- Textless Direct Speech-to-speech Translation With Discrete Speech Representation (2022)9.76
- One-to-many Multilingual End-to-end Speech Translation (2019)9.23
- Unity: Two-pass Direct Speech-to-speech Translation With Discrete Units (2022)9.59
- Enhanced Direct Speech-to-speech Translation Using Self-supervised Pre-training And Data Augmentation (2022)10.85