An Exploration Of Self-supervised Pretrained Representations For End-to-end Speech Recognition
2021 Β· Xuankai Chang, Takashi Maekaku, Pengcheng Guo, et al.
Abstract
Self-supervised pretraining on speech data has achieved a lot of progress. High-fidelity representation of the speech signal is learned from a lot of untranscribed data and shows promising performance. Recently, there are several works focusing on evaluating the quality of self-supervised pretrained representations on various tasks without domain restriction, e.g. SUPERB. However, such evaluations do not provide a comprehensive comparison among many ASR benchmark corpora. In this paper, we focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models. We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR. Without any modification of the back-end model architectures or training strategy, some of the experiments with pretrained representations, e.g., WSJ, WSJ0-2mix with HuBERT, reach or outperform current state
Authors
(none)
Tags
Stats
Related papers
- Investigation Of Ensemble Features Of Self-supervised Pretrained Models For Automatic Speech Recognition (2022)9.41
- Unispeech-sat: Universal Speech Representation Learning With Speaker Aware Pre-training (2021)0.00
- Analyzing The Factors Affecting Usefulness Of Self-supervised Pre-trained Representations For Speech Recognition (2022)0.00
- On The Transferability Of Whisper-based Representations For "in-the-wild" Cross-task Downstream Speech Applications (2023)0.00
- Large-scale Self-supervised Speech Representation Learning For Automatic Speaker Verification (2021)15.25
- Progressive Residual Extraction Based Pre-training For Speech Representation Learning (2024)0.00
- An Exploration Into The Performance Of Unsupervised Cross-task Speech Representations For "in The Wild'' Edge Applications (2023)0.00
- Similarity Analysis Of Self-supervised Speech Representations (2020)10.07