Location, Location: Enhancing The Evaluation Of Text-to-speech Synthesis Using The Rapid Prosody Transcription Paradigm
2021 Β· Elijah Gutierrez, Pilar Oplustil-Gallegos, Catherine Lai
Abstract
Text-to-Speech synthesis systems are generally evaluated using Mean Opinion Score (MOS) tests, where listeners score samples of synthetic speech on a Likert scale. A major drawback of MOS tests is that they only offer a general measure of overall quality-i.e., the naturalness of an utterance-and so cannot tell us where exactly synthesis errors occur. This can make evaluation of the appropriateness of prosodic variation within utterances inconclusive. To address this, we propose a novel evaluation method based on the Rapid Prosody Transcription paradigm. This allows listeners to mark the locations of errors in an utterance in real-time, providing a probabilistic representation of the perceptual errors that occur in the synthetic signal. We conduct experiments that confirm that the fine-grained evaluation can be mapped to system rankings of standard MOS tests, but the error marking gives a much more comprehensive assessment of synthesized prosody. In particular, for standard audiobook te
Authors
(none)
Tags
Stats
Related papers
- Objective Evaluation Of Prosody And Intelligibility In Speech Synthesis Via Conditional Prediction Of Discrete Tokens (2025)0.00
- A Text-to-speech Pipeline, Evaluation Methodology, And Initial Fine-tuning Results For Child Speech Synthesis (2022)10.21
- Investigating Content-aware Neural Text-to-speech MOS Prediction Using Prosodic And Linguistic Features (2022)6.34
- Automos: Learning A Non-intrusive Assessor Of Naturalness-of-speech (2016)0.00
- Evaluating Text-to-speech Synthesis From A Large Discrete Token-based Speech Language Model (2024)0.00
- Speech Is More Than Words: Do Speech-to-text Translation Systems Leverage Prosody? (2024)2.26
- Stable-tts: Stable Speaker-adaptive Text-to-speech Synthesis Via Prosody Prompting (2024)4.52
- Improving Speech Prosody Of Audiobook Text-to-speech Synthesis With Acoustic And Textual Contexts (2022)7.81