Measuring Prosody Diversity In Zero-shot TTS: A New Metric, Benchmark, And Exploration
2025 Β· Yifan Yang, Bing Han, Hui Wang, et al.
Abstract
Prosody diversity is essential for achieving naturalness and expressiveness in zero-shot text-to-speech (TTS). However, frequently used acoustic metrics capture only partial views of prosodic variation and correlate poorly with human perception, leaving the problem of reliably quantifying prosody diversity underexplored. To bridge this gap, we introduce ProsodyEval, a prosody diversity assessment dataset that provides Prosody Mean Opinion Score (PMOS) alongside conventional acoustic metrics. ProsodyEval comprises 1000 speech samples derived from 7 mainstream TTS systems, with 2000 human ratings. Building on this, we propose the Discretized Speech Weighted Edit Distance (DS-WED), a new objective diversity metric that quantifies prosodic variation via weighted edit distance over semantic tokens. Experiments on ProsodyEval show that DS-WED achieves substantially higher correlation with human judgments than existing acoustic metrics, while remaining highly robust in speech tokenization fro
Authors
(none)
Tags
Stats
Related papers
- Objective Evaluation Of Prosody And Intelligibility In Speech Synthesis Via Conditional Prediction Of Discrete Tokens (2025)0.00
- MAD Speech: Measures Of Acoustic Diversity Of Speech (2024)0.00
- Location, Location: Enhancing The Evaluation Of Text-to-speech Synthesis Using The Rapid Prosody Transcription Paradigm (2021)3.58
- Audioeval: Automatic Dual-perspective And Multi-dimensional Evaluation Of Text-to-audio-generation (2025)0.00
- Diverse And Expressive Speech Prosody Prediction With Denoising Diffusion Probabilistic Model (2023)4.52
- Diffstyletts: Diffusion-based Hierarchical Prosody Modeling For Text-to-speech With Diverse And Controllable Styles (2024)0.00
- Dmospeech: Direct Metric Optimization Via Distilled Diffusion Model In Zero-shot Speech Synthesis (2024)0.00
- Daisy-tts: Simulating Wider Spectrum Of Emotions Via Prosody Embedding Decomposition (2024)0.00