Noise-robust Zero-shot Text-to-speech Synthesis Conditioned On Self-supervised Speech-representation Model With Adapters
2024 Β· Kenichi Fujita, Hiroshi Sato, Takanori Ashihara, et al.
Abstract
The zero-shot text-to-speech (TTS) method, based on speaker embeddings extracted from reference speech using self-supervised learning (SSL) speech representations, can reproduce speaker characteristics very accurately. However, this approach suffers from degradation in speech synthesis quality when the reference speech contains noise. In this paper, we propose a noise-robust zero-shot TTS method. We incorporated adapters into the SSL model, which we fine-tuned with the TTS model using noisy reference speech. In addition, to further improve performance, we adopted a speech enhancement (SE) front-end. With these improvements, our proposed SSL-based zero-shot TTS achieved high-quality speech synthesis with noisy reference speech. Through the objective and subjective evaluations, we confirmed that the proposed method is highly robust to noise in reference speech, and effectively works in combination with SE.
Authors
(none)
Tags
Stats
Related papers
- ZMM-TTS: Zero-shot Multilingual And Multispeaker Speech Synthesis Conditioned On Self-supervised Discrete Speech Representations (2023)10.35
- Nnspeech: Speaker-guided Conditional Variational Autoencoder For Zero-shot Multi-speaker Text-to-speech (2022)9.59
- Noise Robust TTS For Low Resource Speakers Using Pre-trained Model And Speech Enhancement (2020)0.00
- DINO-VITS: Data-efficient Zero-shot TTS With Self-supervised Speaker Verification Loss For Noise Robustness (2023)3.58
- Low-resource Self-supervised Learning With Ssl-enhanced TTS (2023)0.00
- Knn Retrieval For Simple And Effective Zero-shot Multi-speaker Text-to-speech (2024)3.58
- Content-dependent Fine-grained Speaker Embedding For Zero-shot Speaker Adaptation In Text-to-speech Synthesis (2022)10.07
- An Investigation Of Noise Robustness For Flow-matching-based Zero-shot TTS (2024)5.24