Ccatmos: Convolutional Context-aware Transformer Network For Non-intrusive Speech Quality Assessment
2022 Β· Yuchen Liu, Li-Chia Yang, Alex Pawlicki, et al.
Abstract
Speech quality assessment has been a critical component in many voice communication related applications such as telephony and online conferencing. Traditional intrusive speech quality assessment requires the clean reference of the degraded utterance to provide an accurate quality measurement. This requirement limits the usability of these methods in real-world scenarios. On the other hand, non-intrusive subjective measurement is the ``golden standard" in evaluating speech quality as human listeners can intrinsically evaluate the quality of any degraded speech with ease. In this paper, we propose a novel end-to-end model structure called Convolutional Context-Aware Transformer (CCAT) network to predict the mean opinion score (MOS) of human raters. We evaluate our model on three MOS-annotated datasets spanning multiple languages and distortion types and submit our results to the ConferencingSpeech 2022 Challenge. Our experiments show that CCAT provides promising MOS predictions compared
Authors
(none)
Tags
Stats
Related papers
- Attentivemos: A Lightweight Attention-only Model For Speech Quality Prediction (2024)3.58
- More For Less: Non-intrusive Speech Quality Assessment With Limited Annotations (2021)7.16
- Mosnet: Deep Learning Based Objective Assessment For Voice Conversion (2019)16.90
- Non-intrusive Speech Quality Assessment Using Neural Networks (2019)13.74
- Metricnet: Towards Improved Modeling For Non-intrusive Speech Quality Assessment (2021)0.00
- Automos: Learning A Non-intrusive Assessor Of Naturalness-of-speech (2016)0.00
- Pre-trained Speech Representations As Feature Extractors For Speech Quality Assessment In Online Conferencing Applications (2022)5.84
- Comparison Of Speech Representations For Automatic Quality Estimation In Multi-speaker Text-to-speech Synthesis (2020)0.00