Aligntts: Efficient Feed-forward Text-to-speech System Without Explicit Alignment
2020 Β· Zhen Zeng, Jianzong Wang, Ning Cheng, et al.
Abstract
Targeting at both high efficiency and performance, we propose AlignTTS to predict the mel-spectrum in parallel. AlignTTS is based on a Feed-Forward Transformer which generates mel-spectrum from a sequence of characters, and the duration of each character is determined by a duration predictor.Instead of adopting the attention mechanism in Transformer TTS to align text to mel-spectrum, the alignment loss is presented to consider all possible alignments in training by use of dynamic programming. Experiments on the LJSpeech dataset show that our model achieves not only state-of-the-art performance which outperforms Transformer TTS by 0.03 in mean option score (MOS), but also a high efficiency which is more than 50 times faster than real-time.
Authors
(none)
Tags
Stats
Related papers
- MELA-TTS: Joint Transformer-diffusion Model With Representation Alignment For Speech Synthesis (2025)0.00
- Efficienttts: An Efficient And High-quality Text-to-speech Architecture (2020)0.00
- Fastspeech: Fast, Robust And Controllable Text To Speech (2019)0.00
- Aligner-guided Training Paradigm: Advancing Text-to-speech Models With Aligner Guided Duration (2024)0.00
- Moboaligner: A Neural Alignment Model For Non-autoregressive TTS With Monotonic Boundary Search (2020)2.26
- Reinforce-aligner: Reinforcement Alignment Search For Robust End-to-end Text-to-speech (2021)8.09
- Glow-tts: A Generative Flow For Text-to-speech Via Monotonic Alignment Search (2020)0.00
- Mixer-tts: Non-autoregressive, Fast And Compact Text-to-speech Model Conditioned On Language Model Embeddings (2021)6.34