Pre-trained Speech Representations As Feature Extractors For Speech Quality Assessment In Online Conferencing Applications
2022 Β· Bastiaan Tamm, Helena Balabin, Rik Vandenberghe, et al.
Abstract
Speech quality in online conferencing applications is typically assessed through human judgements in the form of the mean opinion score (MOS) metric. Since such a labor-intensive approach is not feasible for large-scale speech quality assessments in most settings, the focus has shifted towards automated MOS prediction through end-to-end training of deep neural networks (DNN). Instead of training a network from scratch, we propose to leverage the speech representations from the pre-trained wav2vec-based XLS-R model. However, the number of parameters of such a model exceeds task-specific DNNs by several orders of magnitude, which poses a challenge for resulting fine-tuning procedures on smaller datasets. Therefore, we opt to use pre-trained speech representations from XLS-R in a feature extraction rather than a fine-tuning setting, thereby significantly reducing the number of trainable model parameters. We compare our proposed XLS-R-based feature extractor to a Mel-frequency cepstral coe
Authors
(none)
Tags
Stats
Related papers
- Distillation And Pruning For Scalable Self-supervised Representation-based Speech Quality Assessment (2025)8.09
- Non-intrusive Speech Quality Assessment Using Neural Networks (2019)13.74
- Comparison Of Speech Representations For Automatic Quality Estimation In Multi-speaker Text-to-speech Synthesis (2020)0.00
- Ccatmos: Convolutional Context-aware Transformer Network For Non-intrusive Speech Quality Assessment (2022)5.24
- Efficient Speech Quality Assessment Using Self-supervised Framewise Embeddings (2022)5.84
- A Comparative Re-assessment Of Feature Extractors For Deep Speaker Embeddings (2020)8.09
- More For Less: Non-intrusive Speech Quality Assessment With Limited Annotations (2021)7.16
- Attentivemos: A Lightweight Attention-only Model For Speech Quality Prediction (2024)3.58