QS-TTS: Towards Semi-supervised Text-to-speech Synthesis Via Vector-quantized Self-supervised Speech Representation Learning
2023 Β· Haohan Guo, Fenglong Xie, Jiawen Kang, et al.
Abstract
This paper proposes a novel semi-supervised TTS framework, QS-TTS, to improve TTS quality with lower supervised data requirements via Vector-Quantized Self-Supervised Speech Representation Learning (VQ-S3RL) utilizing more unlabeled speech audio. This framework comprises two VQ-S3R learners: first, the principal learner aims to provide a generative Multi-Stage Multi-Codebook (MSMC) VQ-S3R via the MSMC-VQ-GAN combined with the contrastive S3RL, while decoding it back to the high-quality audio; then, the associate learner further abstracts the MSMC representation into a highly-compact VQ representation through a VQ-VAE. These two generative VQ-S3R learners provide profitable speech representations and pre-trained models for TTS, significantly improving synthesis quality with the lower requirement for supervised data. QS-TTS is evaluated comprehensively under various scenarios via subjective and objective tests in experiments. The results powerfully demonstrate the superior performance of
Authors
(none)
Tags
Stats
Related papers
- DQR-TTS: Semi-supervised Text-to-speech Synthesis With Dynamic Quantized Representation (2023)2.26
- VQTTS: High-fidelity Text-to-speech Synthesis With Self-supervised VQ Acoustic Feature (2022)11.85
- Delightfultts 2: End-to-end Speech Synthesis With Adversarial Vector-quantized Auto-encoders (2022)9.23
- Unsupervised Learning For Sequence-to-sequence Text-to-speech For Low-resource Languages (2020)9.59
- Towards High-quality Neural TTS For Low-resource Languages By Learning Compact Speech Representations (2022)0.00
- An Experimental Study: Assessing The Combined Framework Of Wavlm And BEST-RQ For Text-to-speech Synthesis (2023)0.00
- Towards Unsupervised Speech Recognition And Synthesis With Quantized Speech Representation Learning (2019)0.00
- Preliminary Study On Using Vector Quantization Latent Spaces For TTS/VC Systems With Consistent Performance (2021)0.00