Drspeech: Degradation-robust Text-to-speech Synthesis With Frame-level And Utterance-level Acoustic Representation Learning
2022 Β· Takaaki Saeki, Kentaro Tachibana, Ryuichi Yamamoto
Abstract
Most text-to-speech (TTS) methods use high-quality speech corpora recorded in a well-designed environment, incurring a high cost for data collection. To solve this problem, existing noise-robust TTS methods are intended to use noisy speech corpora as training data. However, they only address either time-invariant or time-variant noises. We propose a degradation-robust TTS method, which can be trained on speech corpora that contain both additive noises and environmental distortions. It jointly represents the time-variant additive noises with a frame-level encoder and the time-invariant environmental distortions with an utterance-level encoder. We also propose a regularization method to attain clean environmental embedding that is disentangled from the utterance-dependent information such as linguistic contents and speaker characteristics. Evaluation results show that our method achieved significantly higher-quality synthetic speech than previous methods in the condition including both a
Authors
(none)
Tags
Stats
Related papers
- Denoispeech: Denoising Text To Speech With Frame-level Noise Modeling (2020)0.00
- Toward Degradation-robust Voice Conversion (2021)5.84
- Environment Aware Text-to-speech Synthesis (2021)6.34
- Noise Robust TTS For Low Resource Speakers Using Pre-trained Model And Speech Enhancement (2020)0.00
- Tts-by-tts: Tts-driven Data Augmentation For Fast And High-quality Speech Synthesis (2020)9.59
- Semi-supervised Learning For Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation (2020)5.24
- STYLER: Style Factor Modeling With Rapidity And Robustness Via Speech Decomposition For Expressive And Controllable Neural Text To Speech (2021)9.23
- Expressive TTS Training With Frame And Style Reconstruction Loss (2020)12.74