Wav2seq: Pre-training Speech-to-text Encoder-decoder Models Using Pseudo Languages
2022 Β· Felix Wu, Kwangyoun Kim, Shinji Watanabe, et al.
Abstract
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task -- transcribing audio inputs into pseudo subword sequences. This process stands on its own, or can be applied as low-cost second-stage pre-training. We experiment with automatic speech recognition (ASR), spoken named entity recognition, and speech-to-text translation. We set new state-of-the-art results for end-to-end spoken named entity recognition, and show consistent improvements on 20 language pairs for speech-to-text translation, even when competing methods use additional text data for training. Finally, on ASR, our approach enables encoder-decoder methods to benefit from pre-training for all parts of the network, and shows comparable performance to highly optimized recent methods.
Authors
(none)
Tags
Stats
Related papers
- Pre-training Transformer Decoder For End-to-end ASR Model With Unpaired Speech Data (2022)13.47
- Wav2vec-s: Semi-supervised Pre-training For Low-resource ASR (2021)7.50
- Unsupervised Pre-training For Sequence To Sequence Speech Recognition (2019)0.00
- Wav2vec: Unsupervised Pre-training For Speech Recognition (2019)0.00
- Self-training And Pre-training Are Complementary For Speech Recognition (2020)14.15
- Self-training For End-to-end Speech Recognition (2019)15.48
- A Noise-robust Self-supervised Pre-training Model Based Speech Representation Learning For Automatic Speech Recognition (2022)11.19
- Wav2vec 2.0: A Framework For Self-supervised Learning Of Speech Representations (2020)0.00