A Unified Framework For Collecting Text-to-speech Synthesis Datasets For 22 Indian Languages
2024 Β· Sujitha Sathiyamoorthy, N Mohana, Anusha Prakash, et al.
Abstract
The performance of a text-to-speech (TTS) synthesis model depends on various factors, of which the quality of the training data is of utmost importance. Millions of data are collected around the globe for various languages, but resources for Indian languages are few. Although there are many efforts involved in data collection, a common set of protocols for data collection becomes necessary for building TTS systems in Indian languages primarily because of the need for a uniform development of TTS systems across languages. In this paper, we present our learnings on data collection efforts' for Indic languages over 15 years. These databases have been used in unit selection synthesis, hidden Markov model based, and end-to-end frameworks, and for generating prosodically rich TTS systems. The most significant feature of the data collected is that data purity enables building high-quality TTS systems with a comparatively small dataset compared to that of European/Chinese languages.
Authors
(none)
Tags
Stats
Related papers
- Towards Building Text-to-speech Systems For The Next Billion Users (2022)0.00
- Generic Indic Text-to-speech Synthesisers With Rapid Adaptation In An End-to-end Framework (2020)8.82
- Towards Developing State-of-the-art TTS Synthesisers For 13 Indian Languages With Signal Processing Aided Alignments (2022)0.00
- Indicvoices-r: Unlocking A Massive Multilingual Multi-speaker Speech Corpus For Scaling Indian TTS (2024)2.26
- Enhancing Out-of-vocabulary Performance Of Indian TTS Systems For Practical Applications Through Low-effort Data Strategies (2024)0.00
- Rapid Speaker Adaptation In Low Resource Text To Speech Systems Using Synthetic Data And Transfer Learning (2023)0.00
- Extending Multilingual Speech Synthesis To 100+ Languages Without Transcribed Data (2024)7.16
- An Automated End-to-end Open-source Software For High-quality Text-to-speech Dataset Generation (2024)0.00