Foundationtts: Text-to-speech For ASR Customization With Generative Language Model
2023 · Ruiqing Xue, Yanqing Liu, Lei He, et al.
Abstract
Neural text-to-speech (TTS) generally consists of cascaded architecture with separately optimized acoustic model and vocoder, or end-to-end architecture with continuous mel-spectrograms or self-extracted speech frames as the intermediate representations to bridge acoustic model and vocoder, which suffers from two limitations: 1) the continuous acoustic frames are hard to predict with phoneme only, and acoustic information like duration or pitch is also needed to solve the one-to-many problem, which is not easy to scale on large scale and noise datasets; 2) to achieve diverse speech output based on continuous speech features, complex VAE or flow-based models are usually required. In this paper, we propose FoundationTTS, a new speech synthesis system with a neural audio codec for discrete speech token extraction and waveform reconstruction and a large language model for discrete token generation from linguistic (phoneme) tokens. Specifically, 1) we propose a hierarchical codec network ba
Authors
(none)
Tags
Stats
Related papers
- Hierarchical Generative Modeling For Controllable Speech Synthesis (2018)0.00
- Fireredtts: A Foundation Text-to-speech Framework For Industry-level Generative Speech Applications (2024)0.00
- High Fidelity Speech Synthesis With Adversarial Networks (2019)0.00
- BASE TTS: Lessons From Building A Billion-parameter Text-to-speech Model On 100K Hours Of Data (2024)0.00
- Fastspeech: Fast, Robust And Controllable Text To Speech (2019)0.00
- Portaspeech: Portable And High-quality Generative Text-to-speech (2021)0.00
- Waveform Generation For Text-to-speech Synthesis Using Pitch-synchronous Multi-scale Generative Adversarial Networks (2018)8.35
- Hierarchical Prosody Modeling And Control In Non-autoregressive Parallel Neural TTS (2021)8.35