LE-SSL-MOS: Self-supervised Learning MOS Prediction With Listener Enhancement
2023 Β· Zili Qi, Xinhui Hu, Wangjin Zhou, et al.
Abstract
Recently, researchers have shown an increasing interest in automatically predicting the subjective evaluation for speech synthesis systems. This prediction is a challenging task, especially on the out-of-domain test set. In this paper, we proposed a novel fusion model for MOS prediction that combines supervised and unsupervised approaches. In the supervised aspect, we developed an SSL-based predictor called LE-SSL-MOS. The LE-SSL-MOS utilizes pre-trained self-supervised learning models and further improves prediction accuracy by utilizing the opinion scores of each utterance in the listener enhancement branch. In the unsupervised aspect, two steps are contained: we fine-tuned the unit language model (ULM) using highly intelligible domain data to improve the correlation of an unsupervised metric - SpeechLMScore. Another is that we utilized ASR confidence as a new metric with the help of ensemble learning. To our knowledge, this is the first architecture that fuses supervised and unsuper
Authors
(none)
Tags
Stats
Related papers
- Ldnet: Unified Listener Dependent Modeling In MOS Prediction For Synthetic Speech (2021)12.74
- SAMOS: A Neural MOS Prediction Model Leveraging Semantic Representations And Acoustic Features (2024)2.26
- Uncertainty As A Predictor: Leveraging Self-supervised Learning For Zero-shot MOS Prediction (2023)6.34
- Neural MOS Prediction For Synthesized Speech Using Multi-task Learning With Spoofing Detection And Spoofing Type Classification (2020)9.59
- A Comparison Of Deep Learning MOS Predictors For Speech Synthesis Quality (2022)6.34
- DDOS: A MOS Prediction Framework Utilizing Domain Adaptive Pre-training And Distribution Of Opinion Scores (2022)9.03
- On The Use Of Self-supervised Speech Representations In Spontaneous Speech Synthesis (2023)0.00
- RAMP: Retrieval-augmented MOS Prediction Via Confidence-based Dynamic Weighting (2023)9.03