Unicats: A Unified Context-aware Text-to-speech Framework With Contextual Vq-diffusion And Vocoding
2023 Β· Chenpeng Du, Yiwei Guo, Feiyu Shen, et al.
Abstract
The utilization of discrete speech tokens, divided into semantic tokens and acoustic tokens, has been proven superior to traditional acoustic feature mel-spectrograms in terms of naturalness and robustness for text-to-speech (TTS) synthesis. Recent popular models, such as VALL-E and SPEAR-TTS, allow zero-shot speaker adaptation through auto-regressive (AR) continuation of acoustic tokens extracted from a short speech prompt. However, these AR models are restricted to generate speech only in a left-to-right direction, making them unsuitable for speech editing where both preceding and following contexts are provided. Furthermore, these models rely on acoustic tokens, which have audio quality limitations imposed by the performance of audio codec models. In this study, we propose a unified context-aware TTS framework called UniCATS, which is capable of both speech continuation and editing. UniCATS comprises two components, an acoustic model CTX-txt2vec and a vocoder CTX-vec2wav. CTX-txt2ve
Authors
(none)
Tags
Stats
Related papers
- Unisyn: An End-to-end Unified Model For Text-to-speech And Singing Voice Synthesis (2022)0.00
- Unifyspeech: A Unified Framework For Zero-shot Text-to-speech And Voice Conversion (2023)0.00
- Ecat: An End-to-end Model For Multi-speaker TTS & Many-to-many Fine-grained Prosody Transfer (2023)0.00
- Unsupervised TTS Acoustic Modeling For TTS With Conditional Disentangled Sequential VAE (2022)5.84
- Cross-utterance Conditioned VAE For Speech Generation (2023)5.84
- VQTTS: High-fidelity Text-to-speech Synthesis With Self-supervised VQ Acoustic Feature (2022)11.85
- VECL-TTS: Voice Identity And Emotional Style Controllable Cross-lingual Text-to-speech (2024)0.00
- Unispeaker: A Unified Approach For Multimodality-driven Speaker Generation (2025)2.26