Diffusion-based Mel-spectrogram Enhancement For Personalized Speech Synthesis With Found Data
2023 Β· Yusheng Tian, Wei Liu, Tan Lee
Abstract
Creating synthetic voices with found data is challenging, as real-world recordings often contain various types of audio degradation. One way to address this problem is to pre-enhance the speech with an enhancement model and then use the enhanced data for text-to-speech (TTS) model training. This paper investigates the use of conditional diffusion models for generalized speech enhancement, which aims at addressing multiple types of audio degradation simultaneously. The enhancement is performed on the log Mel-spectrogram domain to align with the TTS training objective. Text information is introduced as an additional condition to improve the model robustness. Experiments on real-world recordings demonstrate that the synthetic voice built on data enhanced by the proposed model produces higher-quality synthetic speech, compared to those trained on data enhanced by strong baselines. Code and pre-trained parameters of the proposed enhancement model are available at https://github.com/dmse4tts
Authors
(none)
Tags
Stats
Related papers
- Minimally-supervised Speech Synthesis With Conditional Diffusion Model And Language Model: A Comparative Study Of Semantic Coding (2023)8.82
- Fastdiff: A Fast Conditional Diffusion Model For High-quality Speech Synthesis (2022)14.35
- Cold Diffusion For Speech Enhancement (2022)11.85
- Diffusion-based Speech Enhancement With A Weighted Generative-supervised Learning Loss (2023)0.00
- Creating Personalized Synthetic Voices From Post-glossectomy Speech With Guided Diffusion Models (2023)3.58
- High-fidelity Speech Synthesis With Minimal Supervision: All Using Diffusion Models (2023)5.24
- Diffprosody: Diffusion-based Latent Prosody Generation For Expressive Speech Synthesis With Prosody Conditional Adversarial Training (2023)10.07
- Prodiff: Progressive Fast Diffusion Model For High-quality Text-to-speech (2022)0.00