Uncertainty As A Predictor: Leveraging Self-supervised Learning For Zero-shot MOS Prediction
2023 Β· Aditya Ravuri, Erica Cooper, Junichi Yamagishi
Abstract
Predicting audio quality in voice synthesis and conversion systems is a critical yet challenging task, especially when traditional methods like Mean Opinion Scores (MOS) are cumbersome to collect at scale. This paper addresses the gap in efficient audio quality prediction, especially in low-resource settings where extensive MOS data from large-scale listening tests may be unavailable. We demonstrate that uncertainty measures derived from out-of-the-box pretrained self-supervised learning (SSL) models, such as wav2vec, correlate with MOS scores. These findings are based on data from the 2022 and 2023 VoiceMOS challenges. We explore the extent of this correlation across different models and language contexts, revealing insights into how inherent uncertainties in SSL models can serve as effective proxies for audio quality assessment. In particular, we show that the contrastive wav2vec models are the most performant in all settings.
Authors
(none)
Tags
Stats
Related papers
- A Comparison Of Deep Learning MOS Predictors For Speech Synthesis Quality (2022)6.34
- LE-SSL-MOS: Self-supervised Learning MOS Prediction With Listener Enhancement (2023)9.23
- The Voicemos Challenge 2023: Zero-shot Subjective Speech Quality Prediction For Multiple Domains (2023)11.19
- SAMOS: A Neural MOS Prediction Model Leveraging Semantic Representations And Acoustic Features (2024)2.26
- Ldnet: Unified Listener Dependent Modeling In MOS Prediction For Synthetic Speech (2021)12.74
- DDOS: A MOS Prediction Framework Utilizing Domain Adaptive Pre-training And Distribution Of Opinion Scores (2022)9.03
- Mosnet: Deep Learning Based Objective Assessment For Voice Conversion (2019)16.90
- Resource-efficient Fine-tuning Strategies For Automatic MOS Prediction In Text-to-speech For Low-resource Languages (2023)4.52