Wav2vec-switch: Contrastive Learning From Original-noisy Speech Pairs For Robust Speech Recognition
2021 Β· Yiming Wang, Jinyu Li, Heming Wang, et al.
Abstract
The goal of self-supervised learning (SSL) for automatic speech recognition (ASR) is to learn good speech representations from a large amount of unlabeled speech for the downstream ASR task. However, most SSL frameworks do not consider noise robustness which is crucial for real-world applications. In this paper we propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech via contrastive learning. Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network. In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets of each other. By doing this, it enforces the network to have consistent predictions for the original and noisy speech, thus allows to learn contextualized representation with noise robustness. Our experiments on synthesized and real noisy data show the effectiveness of our method: it achieves 2
Authors
(none)
Tags
Stats
Related papers
- A Noise-robust Self-supervised Pre-training Model Based Speech Representation Learning For Automatic Speech Recognition (2022)11.19
- Ccc-wav2vec 2.0: Clustering Aided Cross Contrastive Self-supervised Learning Of Speech Representations (2022)7.81
- Multi-variant Consistency Based Self-supervised Learning For Robust Automatic Speech Recognition (2021)0.00
- Robust Data2vec: Noise-robust Speech Representation Learning For ASR By Combining Regression And Improved Contrastive Learning (2022)9.76
- A Closer Look At Wav2vec2 Embeddings For On-device Single-channel Speech Enhancement (2024)0.00
- Wav2code: Restore Clean Speech Representations Via Codebook Lookup For Noise-robust ASR (2023)8.35
- On Scaling Contrastive Representations For Low-resource Speech Recognition (2021)3.58
- Wav2vec 2.0: A Framework For Self-supervised Learning Of Speech Representations (2020)0.00