Preserving Speaker Information In Direct Speech-to-speech Translation With Non-autoregressive Generation And Pretraining
2024 Β· Rui Zhou, Akinori Ito, Takashi Nose
Abstract
Speech-to-Speech Translation (S2ST) refers to the conversion of speech in one language into semantically equivalent speech in another language, facilitating communication between speakers of different languages. Speech-to-Discrete Unit Translation (S2UT), a mainstream approach for end-to-end S2ST, addresses challenges such as error propagation across modules and slow inference speed often encountered in traditional cascade systems. However, as discrete units primarily capture content information, conventional S2UT methods fail to retain speaker-specific characteristics from the source. Our previous work, SC-S2UT, introduced a speaker adapter and a unit-to-mel structure, enabling the preservation of speaker information and non-autoregressive speech generation. Building on this foundation, this study proposes a self-supervised pretraining method to enrich the information extracted by both the speaker adapter and the unit-to-mel structure. Additionally, we investigate different feature fu
Authors
(none)
Tags
Stats
Related papers
- Enhanced Direct Speech-to-speech Translation Using Self-supervised Pre-training And Data Augmentation (2022)10.85
- Direct Speech-to-speech Translation With Discrete Units (2021)13.97
- Joint Pre-training With Speech And Bilingual Text For Direct Speech To Speech Translation (2022)7.81
- Unity: Two-pass Direct Speech-to-speech Translation With Discrete Units (2022)9.59
- Transpeech: Speech-to-speech Translation With Bilateral Perturbation (2022)0.00
- Leveraging Unsupervised And Weakly-supervised Data To Improve Direct Speech-to-speech Translation (2022)8.35
- Textless Direct Speech-to-speech Translation With Discrete Speech Representation (2022)9.76
- Speech-to-speech Translation With Discrete-unit-based Style Transfer (2023)0.00