Evaluating And Reducing The Distance Between Synthetic And Real Speech Distributions
2022 Β· Christoph Minixhofer, OndΕej Klejch, Peter Bell
Abstract
While modern Text-to-Speech (TTS) systems can produce natural-sounding speech, they remain unable to reproduce the full diversity found in natural speech data. We consider the distribution of all possible real speech samples that could be generated by these speakers alongside the distribution of all synthetic samples that could be generated for the same set of speakers, using a particular TTS system. We set out to quantify the distance between real and synthetic speech via a range of utterance-level statistics related to properties of the speaker, speech prosody and acoustic environment. Differences in the distribution of these statistics are evaluated using the Wasserstein distance. We reduce these distances by providing ground-truth values at generation time, and quantify the improvements to the overall distribution distance, approximated using an automatic speech recognition system. Our best system achieves a 10% reduction in distribution distance.
Authors
(none)
Tags
Stats
Related papers
- Effect Of Data Reduction On Sequence-to-sequence Neural TTS (2018)9.76
- Generating Synthetic Audio Data For Attention-based Speech Recognition Systems (2019)12.68
- Beyond Oversmoothing: Evaluating DDPM And MSE For Scalable Speech Synthesis In ASR (2024)0.00
- Synthetic Speech Detection Based On Temporal Consistency And Distribution Of Speaker Features (2023)0.00
- On The Relevance Of Phoneme Duration Variability Of Synthesized Training Data For Automatic Speech Recognition (2023)2.26
- Evaluating Text-to-speech Synthesis From A Large Discrete Token-based Speech Language Model (2024)0.00
- Comparing The Benefit Of Synthetic Training Data For Various Automatic Speech Recognition Architectures (2021)5.24
- Tts-by-tts 2: Data-selective Augmentation For Neural Speech Synthesis Using Ranking Support Vector Machine With Variational Autoencoder (2022)4.52