Speecht5: Unified-modal Encoder-decoder Pre-training For Spoken Language Processing
2021 Β· Junyi Ao, Rui Wang, Long Zhou, et al.
Abstract
Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder. Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text state
Authors
(none)
Tags
Stats
Related papers
- Speechut: Bridging Speech And Text With Hidden-unit For Encoder-decoder Based Speech-text Pre-training (2022)10.74
- Speechlm: Enhanced Speech Pre-training With Unpaired Textual Data (2022)0.00
- SLAM: A Unified Encoder For Speech And Language Modeling Via Speech-text Joint Pre-training (2021)0.00
- Tencentpretrain: A Scalable And Flexible Toolkit For Pre-training Models Of Different Modalities (2022)7.50
- Mmspeech: Multi-modal Multi-task Encoder-decoder Pre-training For Speech Recognition (2022)6.34
- Token2vec: A Joint Self-supervised Pre-training Framework Using Unpaired Speech And Text (2022)7.16
- Text-guided Hubert: Self-supervised Speech Pre-training Via Generative Adversarial Networks (2024)4.52
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00