FPETS : Fully Parallel End-to-end Text-to-speech System
2018 Β· Dabiao Ma, Zhiba Su, Wenxuan Wang, et al.
Abstract
End-to-end Text-to-speech (TTS) system can greatly improve the quality of synthesised speech. But it usually suffers form high time latency due to its auto-regressive structure. And the synthesised speech may also suffer from some error modes, e.g. repeated words, mispronunciations, and skipped words. In this paper, we propose a novel non-autoregressive, fully parallel end-to-end TTS system (FPETS). It utilizes a new alignment model and the recently proposed U-shape convolutional structure, UFANS. Different from RNN, UFANS can capture long term information in a fully parallel manner. Trainable position encoding and two-step training strategy are used for learning better alignments. Experimental results show FPETS utilizes the power of parallel computation and reaches a significant speed up of inference compared with state-of-the-art end-to-end TTS systems. More specifically, FPETS is 600X faster than Tacotron2, 50X faster than DCTTS and 10X faster than Deep Voice3. And FPETS can genera
Authors
(none)
Tags
Stats
Related papers
- Fastspeech: Fast, Robust And Controllable Text To Speech (2019)0.00
- JETS: Jointly Training Fastspeech2 And Hifi-gan For End To End Text To Speech (2022)12.10
- Efficienttts: An Efficient And High-quality Text-to-speech Architecture (2020)0.00
- Fastspeech 2: Fast And High-quality End-to-end Text To Speech (2020)0.00
- UFANS: U-shaped Fully-parallel Acoustic Neural Structure For Statistical Parametric Speech Synthesis With 20X Faster (2018)0.00
- FLY-TTS: Fast, Lightweight And High-quality End-to-end Text-to-speech Synthesis (2024)0.00
- Parallel Tacotron: Non-autoregressive And Controllable TTS (2020)12.54
- Espnet-tts: Unified, Reproducible, And Integratable Open Source End-to-end Text-to-speech Toolkit (2019)23.32