Devicetts: A Small-footprint, Fast, Stable Network For On-device Text-to-speech
2020 Β· Zhiying Huang, Hao Li, Ming Lei
Abstract
With the number of smart devices increasing, the demand for on-device text-to-speech (TTS) increases rapidly. In recent years, many prominent End-to-End TTS methods have been proposed, and have greatly improved the quality of synthesized speech. However, to ensure the qualified speech, most TTS systems depend on large and complex neural network models, and it's hard to deploy these TTS systems on-device. In this paper, a small-footprint, fast, stable network for on-device TTS is proposed, named as DeviceTTS. DeviceTTS makes use of a duration predictor as a bridge between encoder and decoder so as to avoid the problem of words skipping and repeating in Tacotron. As we all know, model size is a key factor for on-device TTS. For DeviceTTS, Deep Feedforward Sequential Memory Network (DFSMN) is used as the basic component. Moreover, to speed up inference, mix-resolution decoder is proposed for balance the inference speed and speech quality. Experiences are done with WORLD and LPCNet vocoder
Authors
(none)
Tags
Stats
Related papers
- Mobilespeech: A Fast And High-fidelity Framework For Mobile Zero-shot Text-to-speech (2024)0.00
- Fastspeech: Fast, Robust And Controllable Text To Speech (2019)0.00
- FLY-TTS: Fast, Lightweight And High-quality End-to-end Text-to-speech Synthesis (2024)0.00
- High Quality, Lightweight And Adaptable TTS Using Lpcnet (2019)10.97
- Deep Feed-forward Sequential Memory Networks For Speech Synthesis (2018)5.84
- Deep Voice 2: Multi-speaker Neural Text-to-speech (2017)0.00
- Non-autoregressive TTS With Explicit Duration Modelling For Low-resource Highly Expressive Speech (2021)8.82
- Voice Filter: Few-shot Text-to-speech Speaker Adaptation Using Voice Conversion As A Post-processing Module (2022)8.35