Delightfultts 2: End-to-end Speech Synthesis With Adversarial Vector-quantized Auto-encoders
2022 Β· Yanqing Liu, Ruiqing Xue, Lei He, et al.
Abstract
Current text to speech (TTS) systems usually leverage a cascaded acoustic model and vocoder pipeline with mel-spectrograms as the intermediate representations, which suffer from two limitations: 1) the acoustic model and vocoder are separately trained instead of jointly optimized, which incurs cascaded errors; 2) the intermediate speech representations (e.g., mel-spectrogram) are pre-designed and lose phase information, which are sub-optimal. To solve these problems, in this paper, we develop DelightfulTTS 2, a new end-to-end speech synthesis system with automatically learned speech representations and jointly optimized acoustic model and vocoder. Specifically, 1) we propose a new codec network based on vector-quantized auto-encoders with adversarial training (VQ-GAN) to extract intermediate frame-level speech representations (instead of traditional representations like mel-spectrograms) and reconstruct speech waveform; 2) we jointly optimize the acoustic model (based on DelightfulTTS)
Authors
(none)
Tags
Stats
Related papers
- VQCPC-GAN: Variable-length Adversarial Audio Synthesis Using Vector-quantized Contrastive Predictive Coding (2021)5.84
- QS-TTS: Towards Semi-supervised Text-to-speech Synthesis Via Vector-quantized Self-supervised Speech Representation Learning (2023)2.26
- Conditional Variational Autoencoder With Adversarial Learning For End-to-end Text-to-speech (2021)0.00
- VQTTS: High-fidelity Text-to-speech Synthesis With Self-supervised VQ Acoustic Feature (2022)11.85
- Expediting TTS Synthesis With Adversarial Vocoding (2019)6.77
- End-to-end Video-to-speech Synthesis Using Generative Adversarial Networks (2021)11.58
- DQR-TTS: Semi-supervised Text-to-speech Synthesis With Dynamic Quantized Representation (2023)2.26
- Efficient Non-autoregressive GAN Voice Conversion Using Vqwav2vec Features And Dynamic Convolution (2022)0.00