Unsupervised Pre-training For Sequence To Sequence Speech Recognition
2019 Β· Zhiyun Fan, Shiyu Zhou, Bo Xu
Abstract
This paper proposes a novel approach to pre-train encoder-decoder sequence-to-sequence (seq2seq) model with unpaired speech and transcripts respectively. Our pre-training method is divided into two stages, named acoustic pre-trianing and linguistic pre-training. In the acoustic pre-training stage, we use a large amount of speech to pre-train the encoder by predicting masked speech feature chunks with its context. In the linguistic pre-training stage, we generate synthesized speech from a large number of transcripts using a single-speaker text to speech (TTS) system, and use the synthesized paired data to pre-train decoder. This two-stage pre-training method integrates rich acoustic and linguistic knowledge into seq2seq model, which will benefit downstream automatic speech recognition (ASR) tasks. The unsupervised pre-training is finished on AISHELL-2 dataset and we apply the pre-trained model to multiple paired data ratios of AISHELL-1 and HKUST. We obtain relative character error rate
Authors
(none)
Tags
Stats
Related papers
- Pre-training Transformer Decoder For End-to-end ASR Model With Unpaired Speech Data (2022)13.47
- Improving Transformer-based Speech Recognition Using Unsupervised Pre-training (2019)0.00
- Improving Hybrid Ctc/attention End-to-end Speech Recognition With Pretrained Acoustic And Language Model (2021)8.82
- Wav2seq: Pre-training Speech-to-text Encoder-decoder Models Using Pseudo Languages (2022)10.48
- Unsupervised Learning For Sequence-to-sequence Text-to-speech For Low-resource Languages (2020)9.59
- Non-parallel Sequence-to-sequence Voice Conversion With Disentangled Linguistic And Speaker Representations (2019)14.02
- Almost Unsupervised Text To Speech And Automatic Speech Recognition (2019)0.00
- Joint Training Of Speech Enhancement And Self-supervised Model For Noise-robust ASR (2022)0.00