Diffs2ut: A Semantic Preserving Diffusion Model For Textless Direct Speech-to-speech Translation
2023 Β· Yongxin Zhu, Zhujin Gao, Xinyuan Zhou, et al.
Abstract
While Diffusion Generative Models have achieved great success on image generation tasks, how to efficiently and effectively incorporate them into speech generation especially translation tasks remains a non-trivial problem. Specifically, due to the low information density of speech data, the transformed discrete speech unit sequence is much longer than the corresponding text transcription, posing significant challenges to existing auto-regressive models. Furthermore, it is not optimal to brutally apply discrete diffusion on the speech unit sequence while disregarding the continuous space structure, which will degrade the generation performance significantly. In this paper, we propose a novel diffusion model by applying the diffusion forward process in the \textit\{continuous\} speech representation space, while employing the diffusion backward process in the \textit\{discrete\} speech unit space. In this way, we preserve the semantic structure of the continuous speech representation sp
Authors
(none)
Tags
Stats
Related papers
- Minimally-supervised Speech Synthesis With Conditional Diffusion Model And Language Model: A Comparative Study Of Semantic Coding (2023)8.82
- High-fidelity Speech Synthesis With Minimal Supervision: All Using Diffusion Models (2023)5.24
- Diffusion Synthesizer For Efficient Multilingual Speech To Speech Translation (2024)0.00
- DCTTS: Discrete Diffusion Model With Contrastive Learning For Text-to-speech Generation (2023)5.72
- Investigating The Design Space Of Diffusion Models For Speech Enhancement (2023)10.07
- Language Translation, And Change Of Accent For Speech-to-speech Task Using Diffusion Model (2025)0.00
- Diffar: Denoising Diffusion Autoregressive Model For Raw Speech Waveform Generation (2023)0.00
- Fastdiff: A Fast Conditional Diffusion Model For High-quality Speech Synthesis (2022)14.35