Denoispeech: Denoising Text To Speech With Frame-level Noise Modeling
2020 Β· Chen Zhang, Yi Ren, Xu Tan, et al.
Abstract
While neural-based text to speech (TTS) models can synthesize natural and intelligible voice, they usually require high-quality speech data, which is costly to collect. In many scenarios, only noisy speech of a target speaker is available, which presents challenges for TTS model training for this speaker. Previous works usually address the challenge using two methods: 1) training the TTS model using the speech denoised with an enhancement model; 2) taking a single noise embedding as input when training with noisy speech. However, they usually cannot handle speech with real-world complicated noise such as those with high variations along time. In this paper, we develop DenoiSpeech, a TTS system that can synthesize clean speech for a speaker with noisy speech data. In DenoiSpeech, we handle real-world noisy speech by modeling the fine-grained frame-level noise with a noise condition module, which is jointly trained with the TTS model. Experimental results on real-world data show that Den
Authors
(none)
Tags
Stats
Related papers
- Drspeech: Degradation-robust Text-to-speech Synthesis With Frame-level And Utterance-level Acoustic Representation Learning (2022)7.50
- Diffgan-tts: High-fidelity And Efficient Text-to-speech With Denoising Diffusion Gans (2022)0.00
- Speech Denoising By Parametric Resynthesis (2019)7.16
- Noise Robust TTS For Low Resource Speakers Using Pre-trained Model And Speech Enhancement (2020)0.00
- Deep Speech Denoising With Vector Space Projections (2018)0.00
- Norespeech: Knowledge Distillation Based Conditional Diffusion Model For Noise-robust Expressive TTS (2022)0.00
- Rnnoise-ex: Hybrid Speech Enhancement System Based On RNN And Spectral Features (2021)0.00
- Adversarial Training Of Denoising Diffusion Model Using Dual Discriminators For High-fidelity Multi-speaker TTS (2023)2.26