TERA: Self-supervised Learning Of Transformer Encoder Representation For Speech
2020 Β· Andy T. Liu, Shang-Wen Li, Hung-Yi Lee
Abstract
We introduce a self-supervised speech pre-training method called TERA, which stands for Transformer Encoder Representations from Alteration. Recent approaches often learn by using a single auxiliary task like contrastive prediction, autoregressive prediction, or masked reconstruction. Unlike previous methods, we use alteration along three orthogonal axes to pre-train Transformer Encoders on a large amount of unlabeled speech. The model learns through the reconstruction of acoustic frames from their altered counterpart, where we use a stochastic policy to alter along various dimensions: time, frequency, and magnitude. TERA can be used for speech representations extraction or fine-tuning with downstream models. We evaluate TERA on several downstream tasks, including phoneme classification, keyword spotting, speaker recognition, and speech recognition. We present a large-scale comparison of various self-supervised models. TERA achieves strong performance in the comparison by improving upo
Authors
(none)
Tags
Stats
Related papers
- Self-supervised Rewiring Of Pre-trained Speech Encoders: Towards Faster Fine-tuning With Less Labels In Speech Processing (2022)3.58
- Self-taught Recognizer: Toward Unsupervised Adaptation For Speech Foundation Models (2024)2.26
- Improving Transformer-based Speech Recognition Using Unsupervised Pre-training (2019)0.00
- Learning Problem-agnostic Speech Representations From Multiple Self-supervised Tasks (2019)15.54
- Progressive Residual Extraction Based Pre-training For Speech Representation Learning (2024)0.00
- Automatic Data Augmentation Selection And Parametrization In Contrastive Self-supervised Speech Representation Learning (2022)5.24
- Joint Training Of Speech Enhancement And Self-supervised Model For Noise-robust ASR (2022)0.00
- Contentvec: An Improved Self-supervised Speech Representation By Disentangling Speakers (2022)0.00