An Investigation Of Noise Robustness For Flow-matching-based Zero-shot TTS
2024 · Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, et al.
Abstract
Recently, zero-shot text-to-speech (TTS) systems, capable of synthesizing any speaker's voice from a short audio prompt, have made rapid advancements. However, the quality of the generated speech significantly deteriorates when the audio prompt contains noise, and limited research has been conducted to address this issue. In this paper, we explored various strategies to enhance the quality of audio generated from noisy audio prompts within the context of flow-matching-based zero-shot TTS. Our investigation includes comprehensive training strategies: unsupervised pre-training with masked speech denoising, multi-speaker detection and DNSMOS-based data filtering on the pre-training data, and fine-tuning with random noise mixing. The results of our experiments demonstrate significant improvements in intelligibility, speaker similarity, and overall audio quality compared to the approach of applying speech enhancement to the audio prompt.
Authors
(none)
Tags
Stats
Related papers
- DINO-VITS: Data-efficient Zero-shot TTS With Self-supervised Speaker Verification Loss For Noise Robustness (2023)3.58
- Diflow-tts: Compact And Low-latency Zero-shot Text-to-speech With Factorized Discrete Flow Matching (2025)0.00
- Time-layer Adaptive Alignment For Speaker Similarity In Flow-matching Based Zero-shot TTS (2025)0.00
- Noise-robust Zero-shot Text-to-speech Synthesis Conditioned On Self-supervised Speech-representation Model With Adapters (2024)7.50
- Voiceprompter: Robust Zero-shot Voice Conversion With Voice Prompt And Conditional Flow Matching (2025)3.58
- Noise Robust TTS For Low Resource Speakers Using Pre-trained Model And Speech Enhancement (2020)0.00
- Mobilespeech: A Fast And High-fidelity Framework For Mobile Zero-shot Text-to-speech (2024)0.00
- Improving Multi-speaker TTS Prosody Variance With A Residual Encoder And Normalizing Flows (2021)0.00