VQTTS: High-fidelity Text-to-speech Synthesis With Self-supervised VQ Acoustic Feature
2022 Β· Chenpeng Du, Yiwei Guo, Xie Chen, et al.
Abstract
The mainstream neural text-to-speech(TTS) pipeline is a cascade system, including an acoustic model(AM) that predicts acoustic feature from the input transcript and a vocoder that generates waveform according to the given acoustic feature. However, the acoustic feature in current TTS systems is typically mel-spectrogram, which is highly correlated along both time and frequency axes in a complicated way, leading to a great difficulty for the AM to predict. Although high-fidelity audio can be generated by recent neural vocoders from ground-truth(GT) mel-spectrogram, the gap between the GT and the predicted mel-spectrogram from AM degrades the performance of the entire TTS system. In this work, we propose VQTTS, consisting of an AM txt2vec and a vocoder vec2wav, which uses self-supervised vector-quantized(VQ) acoustic feature rather than mel-spectrogram. We redesign both the AM and the vocoder accordingly. In particular, txt2vec basically becomes a classification model instead of a tradit
Authors
(none)
Tags
Stats
Related papers
- QS-TTS: Towards Semi-supervised Text-to-speech Synthesis Via Vector-quantized Self-supervised Speech Representation Learning (2023)2.26
- Delightfultts 2: End-to-end Speech Synthesis With Adversarial Vector-quantized Auto-encoders (2022)9.23
- DQR-TTS: Semi-supervised Text-to-speech Synthesis With Dynamic Quantized Representation (2023)2.26
- Unsupervised Learning For Sequence-to-sequence Text-to-speech For Low-resource Languages (2020)9.59
- An Experimental Study: Assessing The Combined Framework Of Wavlm And BEST-RQ For Text-to-speech Synthesis (2023)0.00
- Vq-wav2vec: Self-supervised Learning Of Discrete Speech Representations (2019)0.00
- Unsupervised TTS Acoustic Modeling For TTS With Conditional Disentangled Sequential VAE (2022)5.84
- Generating Diverse And Natural Text-to-speech Samples Using A Quantized Fine-grained VAE And Auto-regressive Prosody Prior (2020)12.54