Simplespeech: Towards Simple And Efficient Text-to-speech With Scalar Latent Transformer Diffusion Models
2024 Β· Dongchao Yang, Dingdong Wang, Haohan Guo, et al.
Abstract
In this study, we propose a simple and efficient Non-Autoregressive (NAR) text-to-speech (TTS) system based on diffusion, named SimpleSpeech. Its simpleness shows in three aspects: (1) It can be trained on the speech-only dataset, without any alignment information; (2) It directly takes plain text as input and generates speech through an NAR way; (3) It tries to model speech in a finite and compact latent space, which alleviates the modeling difficulty of diffusion. More specifically, we propose a novel speech codec model (SQ-Codec) with scalar quantization, SQ-Codec effectively maps the complex speech signal into a finite and compact latent space, named scalar latent space. Benefits from SQ-Codec, we apply a novel transformer diffusion model in the scalar latent space of SQ-Codec. We train SimpleSpeech on 4k hours of a speech-only dataset, it shows natural prosody and voice cloning ability. Compared with previous large-scale TTS models, it presents significant speech quality and gener
Authors
(none)
Tags
Stats
Related papers
- Naturalspeech 2: Latent Diffusion Models Are Natural And Zero-shot Speech And Singing Synthesizers (2023)0.00
- High-fidelity Speech Synthesis With Minimal Supervision: All Using Diffusion Models (2023)5.24
- Naturalspeech 3: Zero-shot Speech Synthesis With Factorized Codec And Diffusion Models (2024)0.00
- DCTTS: Discrete Diffusion Model With Contrastive Learning For Text-to-speech Generation (2023)5.72
- Diffusion Synthesizer For Efficient Multilingual Speech To Speech Translation (2024)0.00
- Ditto-tts: Diffusion Transformers For Scalable Text-to-speech Without Domain-specific Factors (2024)0.00
- Schrodinger Bridges Beat Diffusion Models On Text-to-speech Synthesis (2023)0.00
- Minimally-supervised Speech Synthesis With Conditional Diffusion Model And Language Model: A Comparative Study Of Semantic Coding (2023)8.82