A Text-to-speech Pipeline, Evaluation Methodology, And Initial Fine-tuning Results For Child Speech Synthesis
2022 Β· Rishabh Jain, Mariam Yiwere, Dan Bigioi, et al.
Abstract
Speech synthesis has come a long way as current text-to-speech (TTS) models can now generate natural human-sounding speech. However, most of the TTS research focuses on using adult speech data and there has been very limited work done on child speech synthesis. This study developed and validated a training pipeline for fine-tuning state-of-the-art (SOTA) neural TTS models using child speech datasets. This approach adopts a multi-speaker TTS retuning workflow to provide a transfer-learning pipeline. A publicly available child speech dataset was cleaned to provide a smaller subset of approximately 19 hours, which formed the basis of our fine-tuning experiments. Both subjective and objective evaluations were performed using a pretrained MOSNet for objective evaluation and a novel subjective framework for mean opinion score (MOS) evaluations. Subjective evaluations achieved the MOS of 3.95 for speech intelligibility, 3.89 for voice naturalness, and 3.96 for voice consistency. Objective eva
Authors
(none)
Tags
Stats
Related papers
- Improved Child Text-to-speech Synthesis Through Fastpitch-based Transfer Learning (2023)0.00
- Evaluating Text-to-speech Synthesis From A Large Discrete Token-based Speech Language Model (2024)0.00
- Investigating Content-aware Neural Text-to-speech MOS Prediction Using Prosodic And Linguistic Features (2022)6.34
- SOMOS: The Samsung Open MOS Dataset For The Evaluation Of Neural Text-to-speech Synthesis (2022)10.74
- Automos: Learning A Non-intrusive Assessor Of Naturalness-of-speech (2016)0.00
- Location, Location: Enhancing The Evaluation Of Text-to-speech Synthesis Using The Rapid Prosody Transcription Paradigm (2021)3.58
- Comparison Of Speech Representations For Automatic Quality Estimation In Multi-speaker Text-to-speech Synthesis (2020)0.00
- Stylespeech: Parameter-efficient Fine Tuning For Pre-trained Controllable Text-to-speech (2024)6.34