Efficiently Trainable Text-to-speech System Based On Deep Convolutional Networks With Guided Attention
2017 Β· Hideyuki Tachibana, Katsuya Uenoyama, Shunsuke Aihara
Abstract
This paper describes a novel text-to-speech (TTS) technique based on deep convolutional neural networks (CNN), without use of any recurrent units. Recurrent neural networks (RNN) have become a standard technique to model sequential data recently, and this technique has been used in some cutting-edge neural TTS techniques. However, training RNN components often requires a very powerful computer, or a very long time, typically several days or weeks. Recent other studies, on the other hand, have shown that CNN-based sequence synthesis can be much faster than RNN-based techniques, because of high parallelizability. The objective of this paper is to show that an alternative neural TTS based only on CNN alleviate these economic costs of training. In our experiment, the proposed Deep Convolutional TTS was sufficiently trained overnight (15 hours), using an ordinary gaming PC equipped with two GPUs, while the quality of the synthesized speech was almost acceptable.
Authors
(none)
Tags
Stats
Related papers
- Efficiently Trained Low-resource Mongolian Text-to-speech System Based On Fullconv-tts (2022)0.00
- EM-TTS: Efficiently Trained Low-resource Mongolian Lightweight Text-to-speech (2024)0.00
- Deep Voice 3: Scaling Text-to-speech With Convolutional Sequence Learning (2017)0.00
- Using Deep Learning Techniques And Inferential Speech Statistics For AI Synthesised Speech Recognition (2021)0.00
- Fast And High-quality Singing Voice Synthesis System Based On Convolutional Neural Networks (2019)8.82
- Fastspeech: Fast, Robust And Controllable Text To Speech (2019)0.00
- Efficienttts: An Efficient And High-quality Text-to-speech Architecture (2020)0.00
- Devicetts: A Small-footprint, Fast, Stable Network For On-device Text-to-speech (2020)0.00