High Quality Streaming Speech Synthesis With Low, Sentence-length-independent Latency
2021 Β· Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, et al.
Abstract
This paper presents an end-to-end text-to-speech system with low latency on a CPU, suitable for real-time applications. The system is composed of an autoregressive attention-based sequence-to-sequence acoustic model and the LPCNet vocoder for waveform generation. An acoustic model architecture that adopts modules from both the Tacotron 1 and 2 models is proposed, while stability is ensured by using a recently proposed purely location-based attention mechanism, suitable for arbitrary sentence length generation. During inference, the decoder is unrolled and acoustic feature generation is performed in a streaming manner, allowing for a nearly constant latency which is independent from the sentence length. Experimental results show that the acoustic model can produce feature sequences with minimal latency about 31 times faster than real-time on a computer CPU and 6.5 times on a mobile CPU, enabling it to meet the conditions required for real-time applications on both devices. The full end-
Authors
(none)
Tags
Stats
Related papers
- Controllable Sequence-to-sequence Neural TTS With LPCNET Backend For Real-time Speech Synthesis On CPU (2020)0.00
- High Quality, Lightweight And Adaptable TTS Using Lpcnet (2019)10.97
- Livespeech: Low-latency Zero-shot Text-to-speech Via Autoregressive Modeling Of Audio Discrete Codes (2024)5.84
- Multi-rate Attention Architecture For Fast Streamable Text-to-speech Spectrum Modeling (2021)6.34
- Lpcnet: Improving Neural Speech Synthesis Through Linear Prediction (2018)0.00
- A Real-time Wideband Neural Vocoder At 1.6 Kb/s Using Lpcnet (2019)12.61
- Neural Speech Synthesis On A Shoestring: Improving The Efficiency Of Lpcnet (2022)5.84
- Tacotron: Towards End-to-end Speech Synthesis (2017)0.00