Latent Filling: Latent Space Data Augmentation For Zero-shot Speech Synthesis
2023 Β· Jae-Sung Bae, Joun Yeop Lee, Ji-Hyun Lee, et al.
Abstract
Previous works in zero-shot text-to-speech (ZS-TTS) have attempted to enhance its systems by enlarging the training data through crowd-sourcing or augmenting existing speech data. However, the use of low-quality data has led to a decline in the overall system performance. To avoid such degradation, instead of directly augmenting the input data, we propose a latent filling (LF) method that adopts simple but effective latent space data augmentation in the speaker embedding space of the ZS-TTS system. By incorporating a consistency loss, LF can be seamlessly integrated into existing ZS-TTS systems without the need for additional training stages. Experimental results show that LF significantly improves speaker similarity while preserving speech quality.
Authors
(none)
Tags
Stats
Related papers
- Megatts 3: Sparse Alignment Enhanced Latent Diffusion Transformer For Zero-shot Speech Synthesis (2025)0.00
- Improving End-to-end Speech Processing By Efficient Text Data Utilization With Latent Synthesis (2023)0.00
- Modeling Multi-speaker Latent Space To Improve Neural TTS: Quick Enrolling New Speaker And Enhancing Premium Voice (2018)0.00
- Zero Shot Text To Speech Augmentation For Automatic Speech Recognition On Low-resource Accented Speech Corpora (2024)2.26
- Low-data? No Problem: Low-resource, Language-agnostic Conversational Text-to-speech Via F0-conditioned Data Augmentation (2022)0.00
- HAM-TTS: Hierarchical Acoustic Modeling For Token-based Zero-shot Text-to-speech With Model And Data Scaling (2024)0.00
- Adversarial Speaker-consistency Learning Using Untranscribed Speech Data For Zero-shot Multi-speaker Text-to-speech (2022)4.52
- Generative Data Augmentation Challenge: Zero-shot Speech Synthesis For Personalized Speech Enhancement (2025)4.52