SAMOS: A Neural MOS Prediction Model Leveraging Semantic Representations And Acoustic Features
2024 Β· Yu-Fei Shi, Yang Ai, Ye-Xin Lu, et al.
Abstract
Assessing the naturalness of speech using mean opinion score (MOS) prediction models has positive implications for the automatic evaluation of speech synthesis systems. Early MOS prediction models took the raw waveform or amplitude spectrum of speech as input, whereas more advanced methods employed self-supervised-learning (SSL) based models to extract semantic representations from speech for MOS prediction. These methods utilized limited aspects of speech information for MOS prediction, resulting in restricted prediction accuracy. Therefore, in this paper, we propose SAMOS, a MOS prediction model that leverages both Semantic and Acoustic information of speech to be assessed. Specifically, the proposed SAMOS leverages a pretrained wav2vec2 to extract semantic representations and uses the feature extractor of a pretrained BiVocoder to extract acoustic features. These two types of features are then fed into the prediction network, which includes multi-task heads and an aggregation layer,
Authors
(none)
Tags
Stats
Related papers
- Investigating Content-aware Neural Text-to-speech MOS Prediction Using Prosodic And Linguistic Features (2022)6.34
- LE-SSL-MOS: Self-supervised Learning MOS Prediction With Listener Enhancement (2023)9.23
- Neural MOS Prediction For Synthesized Speech Using Multi-task Learning With Spoofing Detection And Spoofing Type Classification (2020)9.59
- DDOS: A MOS Prediction Framework Utilizing Domain Adaptive Pre-training And Distribution Of Opinion Scores (2022)9.03
- SOMOS: The Samsung Open MOS Dataset For The Evaluation Of Neural Text-to-speech Synthesis (2022)10.74
- A Comparison Of Deep Learning MOS Predictors For Speech Synthesis Quality (2022)6.34
- MOS-FAD: Improving Fake Audio Detection Via Automatic Mean Opinion Score Prediction (2024)3.58
- Automos: Learning A Non-intrusive Assessor Of Naturalness-of-speech (2016)0.00