Portaspeech: Portable And High-quality Generative Text-to-speech
2021 · Yi Ren, Jinglin Liu, Zhou Zhao
Abstract
Non-autoregressive text-to-speech (NAR-TTS) models such as FastSpeech 2 and Glow-TTS can synthesize high-quality speech from the given text in parallel. After analyzing two kinds of generative NAR-TTS models (VAE and normalizing flow), we find that: VAE is good at capturing the long-range semantics features (e.g., prosody) even with small model size but suffers from blurry and unnatural results; and normalizing flow is good at reconstructing the frequency bin-wise details but performs poorly when the number of model parameters is limited. Inspired by these observations, to generate diverse speech with natural details and rich prosody using a lightweight architecture, we propose PortaSpeech, a portable and high-quality generative text-to-speech model. Specifically, 1) to model both the prosody and mel-spectrogram details accurately, we adopt a lightweight VAE with an enhanced prior followed by a flow-based post-net with strong conditional inputs as the main architecture. 2) To further c
Authors
(none)
Tags
Stats
Related papers
- Glow-tts: A Generative Flow For Text-to-speech Via Monotonic Alignment Search (2020)0.00
- Foundationtts: Text-to-speech For ASR Customization With Generative Language Model (2023)0.00
- Fastspeech: Fast, Robust And Controllable Text To Speech (2019)0.00
- Non-autoregressive Neural Text-to-speech (2019)0.00
- Generating Diverse And Natural Text-to-speech Samples Using A Quantized Fine-grained VAE And Auto-regressive Prosody Prior (2020)12.54
- Waveglow: A Flow-based Generative Network For Speech Synthesis (2018)20.65
- Improving Multi-speaker TTS Prosody Variance With A Residual Encoder And Normalizing Flows (2021)0.00
- Flowtron: An Autoregressive Flow-based Generative Network For Text-to-speech Synthesis (2020)5.91