Speechut: Bridging Speech And Text With Hidden-unit For Encoder-decoder Based Speech-text Pre-training
2022 Β· Ziqiang Zhang, Long Zhou, Junyi Ao, et al.
Abstract
The rapid development of single-modal pre-training has prompted researchers to pay more attention to cross-modal pre-training methods. In this paper, we propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder. Leveraging hidden-unit as an interface to align speech and text, we can decompose the speech-to-text model into a speech-to-unit model and a unit-to-text model, which can be jointly pre-trained with unpaired speech and text data respectively. Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks. Experimental results show that SpeechUT gets substantial improvements over strong baselines, and achieves state-of-the-art performance on both the LibriSpeech ASR and MuST-C ST tasks. To better understand the proposed SpeechUT, detailed analyses are conducted. The code and pre-trained models are available at ht
Authors
(none)
Tags
Stats
Related papers
- Textless Unit-to-unit Training For Many-to-many Multilingual Speech-to-speech Translation (2023)9.23
- Speechlm: Enhanced Speech Pre-training With Unpaired Textual Data (2022)0.00
- SLAM: A Unified Encoder For Speech And Language Modeling Via Speech-text Joint Pre-training (2021)0.00
- Speecht5: Unified-modal Encoder-decoder Pre-training For Spoken Language Processing (2021)6.32
- ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken Language Understanding (2020)9.59
- SPLAT: Speech-language Joint Pre-training For Spoken Language Understanding (2020)10.35
- U-hubert: Unified Mixed-modal Speech Pretraining And Zero-shot Transfer To Unlabeled Modality (2022)5.99
- Joint Pre-training With Speech And Bilingual Text For Direct Speech To Speech Translation (2022)7.81