Towards Developing State-of-the-art TTS Synthesisers For 13 Indian Languages With Signal Processing Aided Alignments
2022 Β· Anusha Prakash, S Umesh, Hema A Murthy
Abstract
End-to-end (E2E) systems synthesise high-quality speech, but this typically requires a large amount of data. As E2E synthesis progressed from Tacotron to FastSpeech2, it became evident that features representing prosody, particularly sub-word durations, are important for error-free synthesis. Variants of FastSpeech use a teacher model or forced alignments for training. This paper uses signal processing cues in tandem with forced alignment to produce accurate phone boundaries for the training data. As a result of better duration modelling, good-quality synthesisers are developed. Evaluations indicate that systems developed using the proposed signal processing-aided approach are better than systems developed using other alignment approaches, especially in low-resource scenarios. Our systems also outperform the existing best TTS systems available for 13 Indian languages.
Authors
(none)
Tags
Stats
Related papers
- Towards Building Text-to-speech Systems For The Next Billion Users (2022)0.00
- Generic Indic Text-to-speech Synthesisers With Rapid Adaptation In An End-to-end Framework (2020)8.82
- A Unified Framework For Collecting Text-to-speech Synthesis Datasets For 22 Indian Languages (2024)0.00
- Reinforce-aligner: Reinforcement Alignment Search For Robust End-to-end Text-to-speech (2021)8.09
- Empowering Global Voices: A Data-efficient, Phoneme-tone Adaptive Approach To High-fidelity Speech Synthesis (2025)0.00
- Rapid Speaker Adaptation In Low Resource Text To Speech Systems Using Synthetic Data And Transfer Learning (2023)0.00
- Exploring An Inter-pausal Unit (IPU) Based Approach For Indic End-to-end TTS Systems (2024)0.00
- Fast And Small Footprint Hybrid Hmm-hifigan Based System For Speech Synthesis In Indian Languages (2023)0.00