FLY-TTS: Fast, Lightweight And High-quality End-to-end Text-to-speech Synthesis
2024 Β· Yinlin Guo, Yening Lv, Jinqiao Dou, et al.
Abstract
While recent advances in Text-To-Speech synthesis have yielded remarkable improvements in generating high-quality speech, research on lightweight and fast models is limited. This paper introduces FLY-TTS, a new fast, lightweight and high-quality speech synthesis system based on VITS. Specifically, 1) We replace the decoder with ConvNeXt blocks that generate Fourier spectral coefficients followed by the inverse short-time Fourier transform to synthesize waveforms; 2) To compress the model size, we introduce the grouped parameter-sharing mechanism to the text encoder and flow-based model; 3) We further employ the large pre-trained WavLM model for adversarial training to improve synthesis quality. Experimental results show that our model achieves a real-time factor of 0.0139 on an Intel Core i9 CPU, 8.8x faster than the baseline (0.1221), with a 1.6x parameter compression. Objective and subjective evaluations indicate that FLY-TTS exhibits comparable speech quality to the strong baseline.
Authors
(none)
Tags
Stats
Related papers
- Lightweight And High-fidelity End-to-end Text-to-speech With Multi-band Generation And Inverse Short-time Fourier Transform (2022)14.57
- Fastspeech: Fast, Robust And Controllable Text To Speech (2019)0.00
- Fastspeech 2: Fast And High-quality End-to-end Text To Speech (2020)0.00
- High Quality, Lightweight And Adaptable TTS Using Lpcnet (2019)10.97
- Devicetts: A Small-footprint, Fast, Stable Network For On-device Text-to-speech (2020)0.00
- Fast Griffin Lim Based Waveform Generation Strategy For Text-to-speech Synthesis (2020)7.16
- Efficienttts: An Efficient And High-quality Text-to-speech Architecture (2020)0.00
- F5-TTS: A Fairytaler That Fakes Fluent And Faithful Speech With Flow Matching (2024)0.00