SLAM: A Unified Encoder For Speech And Language Modeling Via Speech-text Joint Pre-training
2021 Β· Ankur Bapna, Yu-An Chung, Nan Wu, et al.
Abstract
Unsupervised pre-training is now the predominant approach for both text and speech understanding. Self-attention models pre-trained on large amounts of unannotated data have been hugely successful when fine-tuned on downstream tasks from a variety of domains and languages. This paper takes the universality of unsupervised language pre-training one step further, by unifying speech and text pre-training within a single model. We build a single encoder with the BERT objective on unlabeled text together with the w2v-BERT objective on unlabeled speech. To further align our model representations across modalities, we leverage alignment losses, specifically Translation Language Modeling (TLM) and Speech Text Matching (STM) that make use of supervised speech-text recognition data. We demonstrate that incorporating both speech and text data during pre-training can significantly improve downstream quality on CoVoST~2 speech translation, by around 1 BLEU compared to single-modality pre-trained mo
Authors
(none)
Tags
Stats
Related papers
- Speechlm: Enhanced Speech Pre-training With Unpaired Textual Data (2022)0.00
- ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken Language Understanding (2020)9.59
- Speechut: Bridging Speech And Text With Hidden-unit For Encoder-decoder Based Speech-text Pre-training (2022)10.74
- SPLAT: Speech-language Joint Pre-training For Spoken Language Understanding (2020)10.35
- Style Attuned Pre-training And Parameter Efficient Fine-tuning For Spoken Language Understanding (2020)6.77
- Unified Video-language Pre-training With Synchronized Audio (2024)0.00
- Comsl: A Composite Speech-language Model For End-to-end Speech-to-text Translation (2023)0.00
- Speecht5: Unified-modal Encoder-decoder Pre-training For Spoken Language Processing (2021)6.32