Minimally-supervised Speech Synthesis With Conditional Diffusion Model And Language Model: A Comparative Study Of Semantic Coding
2023 Β· Chunyu Qiang, Hao Li, Hao Ni, et al.
Abstract
Recently, there has been a growing interest in text-to-speech (TTS) methods that can be trained with minimal supervision by combining two types of discrete speech representations and using two sequence-to-sequence tasks to decouple TTS. However, existing methods suffer from three problems: the high dimensionality and waveform distortion of discrete speech representations, the prosodic averaging problem caused by the duration prediction model in non-autoregressive frameworks, and the information redundancy and dimension explosion problems of existing semantic encoding methods. To address these problems, three progressive methods are proposed. First, we propose Diff-LM-Speech, an autoregressive structure consisting of a language model and diffusion models, which models the semantic embedding into the mel-spectrogram based on a diffusion model to achieve higher audio quality. We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improv
Authors
(none)
Tags
Stats
Related papers
- High-fidelity Speech Synthesis With Minimal Supervision: All Using Diffusion Models (2023)5.24
- Diffs2ut: A Semantic Preserving Diffusion Model For Textless Direct Speech-to-speech Translation (2023)2.26
- Diffprosody: Diffusion-based Latent Prosody Generation For Expressive Speech Synthesis With Prosody Conditional Adversarial Training (2023)10.07
- Diffusion-based Mel-spectrogram Enhancement For Personalized Speech Synthesis With Found Data (2023)7.31
- Conditional Latent Diffusion-based Speech Enhancement Via Dual Context Learning (2025)10.81
- Dmospeech: Direct Metric Optimization Via Distilled Diffusion Model In Zero-shot Speech Synthesis (2024)0.00
- Text-to-speech Synthesis Based On Latent Variable Conversion Using Diffusion Probabilistic Model And Variational Autoencoder (2022)0.00
- Diffcss: Diverse And Expressive Conversational Speech Synthesis With Diffusion Models (2025)0.00