Fireredtts: A Foundation Text-to-speech Framework For Industry-level Generative Speech Applications
2024 · Hao-Han Guo, Yao Hu, Kun Liu, et al.
Abstract
This work proposes FireRedTTS, a foundation text-to-speech framework, to meet the growing demands for personalized and diverse generative speech applications. The framework comprises three parts: data processing, foundation system, and downstream applications. First, we comprehensively present our data processing pipeline, which transforms massive raw audio into a large-scale high-quality TTS dataset with rich annotations and a wide coverage of content, speaking style, and timbre. Then, we propose a language-model-based foundation TTS system. The speech signal is compressed into discrete semantic tokens via a semantic-aware speech tokenizer, and can be generated by a language model from the prompt text and audio. Then, a two-stage waveform generator is proposed to decode them to the high-fidelity waveform. We present two applications of this system: voice cloning for dubbing and human-like speech generation for chatbots. The experimental results demonstrate the solid in-context learnin
Authors
(none)
Tags
Stats
Related papers
- Foundationtts: Text-to-speech For ASR Customization With Generative Language Model (2023)0.00
- An Automated End-to-end Open-source Software For High-quality Text-to-speech Dataset Generation (2024)0.00
- Mobilespeech: A Fast And High-fidelity Framework For Mobile Zero-shot Text-to-speech (2024)0.00
- Empowering Global Voices: A Data-efficient, Phoneme-tone Adaptive Approach To High-fidelity Speech Synthesis (2025)0.00
- Fish-speech: Leveraging Large Language Models For Advanced Multilingual Text-to-speech Synthesis (2024)0.00
- BASE TTS: Lessons From Building A Billion-parameter Text-to-speech Model On 100K Hours Of Data (2024)0.00
- High Fidelity Speech Synthesis With Adversarial Networks (2019)0.00
- High Fidelity Text-to-speech Via Discrete Tokens Using Token Transducer And Group Masked Language Model (2024)4.52