EM-TTS: Efficiently Trained Low-resource Mongolian Lightweight Text-to-speech
2024 Β· Ziqi Liang, Haoxiang Shi, Jiawei Wang, et al.
Abstract
Recently, deep learning-based Text-to-Speech (TTS) systems have achieved high-quality speech synthesis results. Recurrent neural networks have become a standard modeling technique for sequential data in TTS systems and are widely used. However, training a TTS model which includes RNN components requires powerful GPU performance and takes a long time. In contrast, CNN-based sequence synthesis techniques can significantly reduce the parameters and training time of a TTS model while guaranteeing a certain performance due to their high parallelism, which alleviate these economic costs of training. In this paper, we propose a lightweight TTS system based on deep convolutional neural networks, which is a two-stage training end-to-end TTS model and does not employ any recurrent units. Our model consists of two stages: Text2Spectrum and SSRN. The former is used to encode phonemes into a coarse mel spectrogram and the latter is used to synthesize the complete spectrum from the coarse mel spectr
Authors
(none)
Tags
Stats
Related papers
- Efficiently Trained Low-resource Mongolian Text-to-speech System Based On Fullconv-tts (2022)0.00
- Efficiently Trainable Text-to-speech System Based On Deep Convolutional Networks With Guided Attention (2017)16.41
- Low-resource Mongolian Speech Synthesis Based On Automatic Prosody Annotation (2022)0.00
- Towards High-quality Neural TTS For Low-resource Languages By Learning Compact Speech Representations (2022)0.00
- Mntts2: An Open-source Multi-speaker Mongolian Text-to-speech Synthesis Dataset (2022)5.81
- Mntts: An Open-source Mongolian Text-to-speech Synthesis Dataset And Accompanied Baseline (2022)5.24
- Fastspeech: Fast, Robust And Controllable Text To Speech (2019)0.00
- Neural Speech Synthesis With Transformer Network (2018)19.95