EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-to-speech Models
2022 Β· Perry Lam, Huayun Zhang, Nancy F. Chen, et al.
Abstract
Neural models are known to be over-parameterized, and recent work has shown that sparse text-to-speech (TTS) models can outperform dense models. Although a plethora of sparse methods has been proposed for other domains, such methods have rarely been applied in TTS. In this work, we seek to answer the question: what are the characteristics of selected sparse techniques on the performance and model complexity? We compare a Tacotron2 baseline and the results of applying five techniques. We then evaluate the performance via the factors of naturalness, intelligibility and prosody, while reporting model size and training time. Complementary to prior research, we find that pruning before or during training can achieve similar performance to pruning after training and can be trained much faster, while removing entire neurons degrades performance much more than removing parameters. To our best knowledge, this is the first work that compares sparsity paradigms in text-to-speech synthesis.
Authors
(none)
Tags
Stats
Related papers
- On The Interplay Between Sparsity, Naturalness, Intelligibility, And Prosody In Speech Synthesis (2021)5.24
- SNIPER Training: Single-shot Sparse Training For Text-to-speech (2022)0.00
- Personalized Lightweight Text-to-speech: Voice Cloning With Adaptive Structured Pruning (2023)6.34
- Non-autoregressive TTS With Explicit Duration Modelling For Low-resource Highly Expressive Speech (2021)8.82
- Dynamic Encoder Size Based On Data-driven Layer-wise Pruning For Speech Recognition (2024)5.24
- BASE TTS: Lessons From Building A Billion-parameter Text-to-speech Model On 100K Hours Of Data (2024)0.00
- Evaluating Text-to-speech Synthesis From A Large Discrete Token-based Speech Language Model (2024)0.00
- A Study On The Efficacy Of Model Pre-training In Developing Neural Text-to-speech System (2021)2.26