Investigating Content-aware Neural Text-to-speech MOS Prediction Using Prosodic And Linguistic Features
2022 Β· Alexandra Vioni, Georgia Maniati, Nikolaos Ellinas, et al.
Abstract
Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models. Such MOS prediction models include MOSNet and LDNet that use spectral features as input, and SSL-MOS that relies on a pretrained self-supervised learning model that directly uses the speech signal as input. In modern high-quality neural TTS systems, prosodic appropriateness with regard to the spoken content is a decisive factor for speech naturalness. For this reason, we propose to include prosodic and linguistic features as additional inputs in MOS prediction systems, and evaluate their impact on the prediction outcome. We consider phoneme level F0 and duration features as prosodic inputs, as well as Tacotron encoder outputs, POS tags and BERT embeddings as higher-level linguistic inputs. All MOS prediction systems are trained on SOMOS, a neural TTS-only dataset with crowdsourced naturalness MOS evaluations. Results show that the proposed additional features are benefi
Authors
(none)
Tags
Stats
Related papers
- SAMOS: A Neural MOS Prediction Model Leveraging Semantic Representations And Acoustic Features (2024)2.26
- SOMOS: The Samsung Open MOS Dataset For The Evaluation Of Neural Text-to-speech Synthesis (2022)10.74
- Resource-efficient Fine-tuning Strategies For Automatic MOS Prediction In Text-to-speech For Low-resource Languages (2023)4.52
- A Comparison Of Deep Learning MOS Predictors For Speech Synthesis Quality (2022)6.34
- Learning To Maximize Speech Quality Directly Using MOS Prediction For Neural Text-to-speech (2020)7.81
- Neural MOS Prediction For Synthesized Speech Using Multi-task Learning With Spoofing Detection And Spoofing Type Classification (2020)9.59
- Ldnet: Unified Listener Dependent Modeling In MOS Prediction For Synthetic Speech (2021)12.74
- LE-SSL-MOS: Self-supervised Learning MOS Prediction With Listener Enhancement (2023)9.23