MMMOS: Multi-domain Multi-axis Audio Quality Assessment
2025 Β· Yi-Cheng Lin, Jia-Hung Chen, Hung-Yi Lee
Abstract
Accurate audio quality estimation is essential for developing and evaluating audio generation, retrieval, and enhancement systems. Existing non-intrusive assessment models predict a single Mean Opinion Score (MOS) for speech, merging diverse perceptual factors and failing to generalize beyond speech. We propose MMMOS, a no-reference, multi-domain audio quality assessment system that estimates four orthogonal axes: Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness across speech, music, and environmental sounds. MMMOS fuses frame-level embeddings from three pretrained encoders (WavLM, MuQ, and M2D) and evaluates three aggregation strategies with four loss functions. By ensembling the top eight models, MMMOS shows a 20-30% reduction in mean squared error and a 4-5% increase in Kendall's \{\tau\} versus baseline, gains first place in six of eight Production Complexity metrics, and ranks among the top three on 17 of 32 challenge metrics.
Authors
(none)
Tags
Stats
Related papers
- Non-intrusive Speech Quality Assessment Using Neural Networks (2019)13.74
- Automos: Learning A Non-intrusive Assessor Of Naturalness-of-speech (2016)0.00
- Attention-based Speech Enhancement Using Human Quality Perception Modelling (2023)0.00
- More For Less: Non-intrusive Speech Quality Assessment With Limited Annotations (2021)7.16
- A Comparison Of Deep Learning MOS Predictors For Speech Synthesis Quality (2022)6.34
- Uncertainty As A Predictor: Leveraging Self-supervised Learning For Zero-shot MOS Prediction (2023)6.34
- MOS-FAD: Improving Fake Audio Detection Via Automatic Mean Opinion Score Prediction (2024)3.58
- Ldnet: Unified Listener Dependent Modeling In MOS Prediction For Synthetic Speech (2021)12.74