Reinforce-aligner: Reinforcement Alignment Search For Robust End-to-end Text-to-speech
2021 Β· Hyunseung Chung, Sang-Hoon Lee, Seong-Whan Lee
Abstract
Text-to-speech (TTS) synthesis is the process of producing synthesized speech from text or phoneme input. Traditional TTS models contain multiple processing steps and require external aligners, which provide attention alignments of phoneme-to-frame sequences. As the complexity increases and efficiency decreases with every additional step, there is expanding demand in modern synthesis pipelines for end-to-end TTS with efficient internal aligners. In this work, we propose an end-to-end text-to-waveform network with a novel reinforcement learning based duration search method. Our proposed generator is feed-forward and the aligner trains the agent to make optimal duration predictions by receiving active feedback from actions taken to maximize cumulative reward. We demonstrate accurate alignments of phoneme-to-frame sequence generated from trained agents enhance fidelity and naturalness of synthesized audio. Experimental results also show the superiority of our proposed model compared to ot
Authors
(none)
Tags
Stats
Related papers
- Initial Investigation Of An Encoder-decoder End-to-end TTS Framework Using Marginalization Of Monotonic Hard Latent Alignments (2019)0.00
- End-to-end Adversarial Text-to-speech (2020)0.00
- Aligner-guided Training Paradigm: Advancing Text-to-speech Models With Aligner Guided Duration (2024)0.00
- Efficienttts: An Efficient And High-quality Text-to-speech Architecture (2020)0.00
- Towards Developing State-of-the-art TTS Synthesisers For 13 Indian Languages With Signal Processing Aided Alignments (2022)0.00
- Autotts: End-to-end Text-to-speech Synthesis Through Differentiable Duration Modeling (2022)0.00
- Moboaligner: A Neural Alignment Model For Non-autoregressive TTS With Monotonic Boundary Search (2020)2.26
- Fastspeech: Fast, Robust And Controllable Text To Speech (2019)0.00