Objective Evaluation Of Prosody And Intelligibility In Speech Synthesis Via Conditional Prediction Of Discrete Tokens
2025 Β· Ismail Rasim Ulgen, Zongyang Du, Junchen Lu, et al.
Abstract
Objective evaluation of synthesized speech is critical for advancing speech generation systems, yet existing metrics for intelligibility and prosody remain limited in scope and weakly correlated with human perception. Word Error Rate (WER) provides only a coarse text-based measure of intelligibility, while F0-RMSE and related pitch-based metrics offer a narrow, reference-dependent view of prosody. To address these limitations, we propose TTScore, a targeted and reference-free evaluation framework based on conditional prediction of discrete speech tokens. TTScore employs two sequence-to-sequence predictors conditioned on input text: TTScore-int, which measures intelligibility through content tokens, and TTScore-pro, which evaluates prosody through prosody tokens. For each synthesized utterance, the predictors compute the likelihood of the corresponding token sequences, yielding interpretable scores that capture alignment with intended linguistic content and prosodic structure. Experimen
Authors
(none)
Tags
Stats
Related papers
- Evaluating Text-to-speech Synthesis From A Large Discrete Token-based Speech Language Model (2024)0.00
- Location, Location: Enhancing The Evaluation Of Text-to-speech Synthesis Using The Rapid Prosody Transcription Paradigm (2021)3.58
- Pairwise Evaluation Of Accent Similarity In Speech Synthesis (2025)3.58
- Measuring Prosody Diversity In Zero-shot TTS: A New Metric, Benchmark, And Exploration (2025)0.00
- Comprehensive Evaluation Of Statistical Speech Waveform Synthesis (2018)5.24
- Speechbertscore: Reference-aware Automatic Evaluation Of Speech Generation Leveraging NLP Evaluation Metrics (2024)10.74
- Stable-tts: Stable Speaker-adaptive Text-to-speech Synthesis Via Prosody Prompting (2024)4.52
- Prosody-controllable Spontaneous TTS With Neural Hmms (2022)8.09