Daspeech: Directed Acyclic Transformer For Fast And High-quality Speech-to-speech Translation
2023 Β· Qingkai Fang, Yan Zhou, Yang Feng
Abstract
Direct speech-to-speech translation (S2ST) translates speech from one language into another using a single model. However, due to the presence of linguistic and acoustic diversity, the target speech follows a complex multimodal distribution, posing challenges to achieving both high-quality translations and fast decoding speeds for S2ST models. In this paper, we propose DASpeech, a non-autoregressive direct S2ST model which realizes both fast and high-quality S2ST. To better capture the complex distribution of the target speech, DASpeech adopts the two-pass architecture to decompose the generation process into two steps, where a linguistic decoder first generates the target text, and an acoustic decoder then generates the target speech based on the hidden states of the linguistic decoder. Specifically, we use the decoder of DA-Transformer as the linguistic decoder, and use FastSpeech 2 as the acoustic decoder. DA-Transformer models translations with a directed acyclic graph (DAG). To co
Authors
(none)
Tags
Stats
Related papers
- Textless Direct Speech-to-speech Translation With Discrete Speech Representation (2022)9.76
- Unity: Two-pass Direct Speech-to-speech Translation With Discrete Units (2022)9.59
- SLM-S2ST: A Multimodal Language Model For Direct Speech-to-speech Translation (2025)0.00
- Translatotron 2: High-quality Direct Speech-to-speech Translation With Voice Preservation (2021)0.00
- Synchronous Speech Recognition And Speech-to-text Translation With Interactive Decoding (2019)10.48
- Joint Pre-training With Speech And Bilingual Text For Direct Speech To Speech Translation (2022)7.81
- Direct Speech-to-speech Translation With Discrete Units (2021)13.97
- Preserving Speaker Information In Direct Speech-to-speech Translation With Non-autoregressive Generation And Pretraining (2024)0.00