Automos: Learning A Non-intrusive Assessor Of Naturalness-of-speech
2016 Β· Brian Patton, Yannis Agiomyrgiannakis, Michael Terry, et al.
Abstract
Developers of text-to-speech synthesizers (TTS) often make use of human raters to assess the quality of synthesized speech. We demonstrate that we can model human raters' mean opinion scores (MOS) of synthesized speech using a deep recurrent neural network whose inputs consist solely of a raw waveform. Our best models provide utterance-level estimates of MOS only moderately inferior to sampled human ratings, as shown by Pearson and Spearman correlations. When multiple utterances are scored and averaged, a scenario common in synthesizer quality assessment, AutoMOS achieves correlations approaching those of human raters. The AutoMOS model has a number of applications, such as the ability to explore the parameter space of a speech synthesizer without requiring a human-in-the-loop.
Authors
(none)
Tags
Stats
Related papers
- Comparison Of Speech Representations For Automatic Quality Estimation In Multi-speaker Text-to-speech Synthesis (2020)0.00
- SAMOS: A Neural MOS Prediction Model Leveraging Semantic Representations And Acoustic Features (2024)2.26
- SOMOS: The Samsung Open MOS Dataset For The Evaluation Of Neural Text-to-speech Synthesis (2022)10.74
- Investigating Content-aware Neural Text-to-speech MOS Prediction Using Prosodic And Linguistic Features (2022)6.34
- Neural MOS Prediction For Synthesized Speech Using Multi-task Learning With Spoofing Detection And Spoofing Type Classification (2020)9.59
- Mosnet: Deep Learning Based Objective Assessment For Voice Conversion (2019)16.90
- Ldnet: Unified Listener Dependent Modeling In MOS Prediction For Synthetic Speech (2021)12.74
- A Comparison Of Deep Learning MOS Predictors For Speech Synthesis Quality (2022)6.34