Moboaligner: A Neural Alignment Model For Non-autoregressive TTS With Monotonic Boundary Search
2020 Β· Naihan Li, Shujie Liu, Yanqing Liu, et al.
Abstract
To speed up the inference of neural speech synthesis, non-autoregressive models receive increasing attention recently. In non-autoregressive models, additional durations of text tokens are required to make a hard alignment between the encoder and the decoder. The duration-based alignment plays a crucial role since it controls the correspondence between text tokens and spectrum frames and determines the rhythm and speed of synthesized audio. To get better duration-based alignment and improve the quality of non-autoregressive speech synthesis, in this paper, we propose a novel neural alignment model named MoboAligner. Given the pairs of the text and mel spectrum, MoboAligner tries to identify the boundaries of text tokens in the given mel spectrum frames based on the token-frame similarity in the neural semantic space with an end-to-end framework. With these boundaries, durations can be extracted and used in the training of non-autoregressive TTS models. Compared with the duration extrac
Authors
(none)
Tags
Stats
Related papers
- Reinforce-aligner: Reinforcement Alignment Search For Robust End-to-end Text-to-speech (2021)8.09
- Aligner-guided Training Paradigm: Advancing Text-to-speech Models With Aligner Guided Duration (2024)0.00
- Aligntts: Efficient Feed-forward Text-to-speech System Without Explicit Alignment (2020)11.76
- Initial Investigation Of An Encoder-decoder End-to-end TTS Framework Using Marginalization Of Monotonic Hard Latent Alignments (2019)0.00
- MELA-TTS: Joint Transformer-diffusion Model With Representation Alignment For Speech Synthesis (2025)0.00
- Autotts: End-to-end Text-to-speech Synthesis Through Differentiable Duration Modeling (2022)0.00
- Fastspeech: Fast, Robust And Controllable Text To Speech (2019)0.00
- Continuous Autoregressive Modeling With Stochastic Monotonic Alignment For Speech Synthesis (2025)0.00