Automatic Tuning Of Loss Trade-offs Without Hyper-parameter Search In End-to-end Zero-shot Speech Synthesis
2023 Β· Seongyeon Park, Bohyung Kim, Tae-Hyun Oh
Abstract
Recently, zero-shot TTS and VC methods have gained attention due to their practicality of being able to generate voices even unseen during training. Among these methods, zero-shot modifications of the VITS model have shown superior performance, while having useful properties inherited from VITS. However, the performance of VITS and VITS-based zero-shot models vary dramatically depending on how the losses are balanced. This can be problematic, as it requires a burdensome procedure of tuning loss balance hyper-parameters to find the optimal balance. In this work, we propose a novel framework that finds this optimum without search, by inducing the decoder of VITS-based models to its full reconstruction ability. With our framework, we show superior performance compared to baselines in zero-shot TTS and VC, achieving state-of-the-art performance. Furthermore, we show the robustness of our framework in various settings. We provide an explanation for the results in the discussion.
Authors
(none)
Tags
Stats
Related papers
- Yourtts: Towards Zero-shot Multi-speaker TTS And Zero-shot Voice Conversion For Everyone (2021)0.00
- Robust Disentangled Variational Speech Representation Learning For Zero-shot Voice Conversion (2022)10.97
- Training Robust Zero-shot Voice Conversion Models With Self-supervised Features (2021)7.16
- Autocycle-vc: Towards Bottleneck-independent Zero-shot Cross-lingual Voice Conversion (2023)0.00
- ACE-VC: Adaptive And Controllable Voice Conversion Using Explicitly Disentangled Self-supervised Speech Representations (2023)0.00
- End-to-end Zero-shot Voice Conversion With Location-variable Convolutions (2022)7.50
- Improvement Speaker Similarity For Zero-shot Any-to-any Voice Conversion Of Whispered And Regular Speech (2024)4.52
- Zero-shot Voice Conversion Via Self-supervised Prosody Representation Learning (2021)6.34