Aligner-guided Training Paradigm: Advancing Text-to-speech Models With Aligner Guided Duration
2024 Β· Haowei Lou, Helen Paik, Wen Hu, et al.
Abstract
Recent advancements in text-to-speech (TTS) systems, such as FastSpeech and StyleSpeech, have significantly improved speech generation quality. However, these models often rely on duration generated by external tools like the Montreal Forced Aligner, which can be time-consuming and lack flexibility. The importance of accurate duration is often underestimated, despite their crucial role in achieving natural prosody and intelligibility. To address these limitations, we propose a novel Aligner-Guided Training Paradigm that prioritizes accurate duration labelling by training an aligner before the TTS model. This approach reduces dependence on external tools and enhances alignment accuracy. We further explore the impact of different acoustic features, including Mel-Spectrograms, MFCCs, and latent features, on TTS model performance. Our experimental results show that aligner-guided duration labelling can achieve up to a 16% improvement in word error rate and significantly enhance phoneme and
Authors
(none)
Tags
Stats
Related papers
- Aligntts: Efficient Feed-forward Text-to-speech System Without Explicit Alignment (2020)11.76
- Reinforce-aligner: Reinforcement Alignment Search For Robust End-to-end Text-to-speech (2021)8.09
- End-to-end Text-to-speech Using Latent Duration Based On VQ-VAE (2020)6.77
- Autotts: End-to-end Text-to-speech Synthesis Through Differentiable Duration Modeling (2022)0.00
- Moboaligner: A Neural Alignment Model For Non-autoregressive TTS With Monotonic Boundary Search (2020)2.26
- On The Relevance Of Phoneme Duration Variability Of Synthesized Training Data For Automatic Speech Recognition (2023)2.26
- Improving Robustness Of Llm-based Speech Synthesis By Learning Monotonic Alignment (2024)0.00
- Efficienttts: An Efficient And High-quality Text-to-speech Architecture (2020)0.00