Perceive And Predict: Self-supervised Speech Representation Based Loss Functions For Speech Enhancement
2023 Β· George Close, William Ravenscroft, Thomas Hain, et al.
Abstract
Recent work in the domain of speech enhancement has explored the use of self-supervised speech representations to aid in the training of neural speech enhancement models. However, much of this work focuses on using the deepest or final outputs of self supervised speech representation models, rather than the earlier feature encodings. The use of self supervised representations in such a way is often not fully motivated. In this work it is shown that the distance between the feature encodings of clean and noisy speech correlate strongly with psychoacoustically motivated measures of speech quality and intelligibility, as well as with human Mean Opinion Score (MOS) ratings. Experiments using this distance as a loss function are performed and improved performance over the use of STFT spectrogram distance based loss as well as other common loss functions from speech enhancement literature is demonstrated using objective measures such as perceptual evaluation of speech quality (PESQ) and shor
Authors
(none)
Tags
Stats
Related papers
- A Consolidated View Of Loss Functions For Supervised Deep Learning-based Speech Enhancement (2020)13.93
- Attention-based Speech Enhancement Using Human Quality Perception Modelling (2023)0.00
- An Empirical Study On Speech Restoration Guided By Self Supervised Speech Representation (2023)4.52
- Unsupervised Speech Enhancement With Speech Recognition Embedding And Disentanglement Losses (2021)8.35
- Downstream Task Agnostic Speech Enhancement With Self-supervised Representation Loss (2023)6.77
- Effect Of Noise Suppression Losses On Speech Distortion And ASR Performance (2021)10.74
- Using RLHF To Align Speech Enhancement Approaches To Mean-opinion Quality Scores (2024)0.00
- PL-EESR: Perceptual Loss Based END-TO-END Robust Speaker Representation Extraction (2021)6.77