Speechlm: Enhanced Speech Pre-training With Unpaired Textual Data
2022 Β· Ziqiang Zhang, Sanyuan Chen, Long Zhou, et al.
Abstract
How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities, including phoneme-unit and hidden-unit tokenizers, which can be trained using a small amount of paired speech-text data. Based on the trained tokenizers, we convert the unlabeled speech and text data into tokens of phoneme units or hidden units. The pre-training objective is designed to unify the speech and the text into the same discrete semantic space with a unified Transformer network. We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluatio
Authors
(none)
Tags
Stats
Related papers
- Speechut: Bridging Speech And Text With Hidden-unit For Encoder-decoder Based Speech-text Pre-training (2022)10.74
- SLAM: A Unified Encoder For Speech And Language Modeling Via Speech-text Joint Pre-training (2021)0.00
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00
- Speecht5: Unified-modal Encoder-decoder Pre-training For Spoken Language Processing (2021)6.32
- Latent Speech-text Transformer (2025)3.04
- ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken Language Understanding (2020)9.59
- Token2vec: A Joint Self-supervised Pre-training Framework Using Unpaired Speech And Text (2022)7.16
- SPLAT: Speech-language Joint Pre-training For Spoken Language Understanding (2020)10.35