SPLAT: Speech-language Joint Pre-training For Spoken Language Understanding
2020 Β· Yu-An Chung, Chenguang Zhu, Michael Zeng
Abstract
Spoken language understanding (SLU) requires a model to analyze input acoustic signal to understand its linguistic content and make predictions. To boost the models' performance, various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text. However, the inherent disparities between the two modalities necessitate a mutual analysis. In this paper, we propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules. Besides conducting a self-supervised masked language modeling task on the two individual modules using unpaired speech and text, SPLAT aligns representations from the two modules in a shared latent space using a small amount of paired speech and text. Thus, during fine-tuning, the speech module alone can produce representations carrying both acoustic information and contextual semantic knowledge of an input acoustic signal. Experimental results verify the effectiveness
Authors
(none)
Tags
Stats
Related papers
- Pre-training For Spoken Language Understanding With Joint Textual And Phonetic Representation Learning (2021)2.26
- Understanding Semantics From Speech Through Pre-training (2019)0.00
- Speech-language Pre-training For End-to-end Spoken Language Understanding (2021)9.41
- On Joint Training With Interfaces For Spoken Language Understanding (2021)7.16
- SLAM: A Unified Encoder For Speech And Language Modeling Via Speech-text Joint Pre-training (2021)0.00
- A Study On The Integration Of Pre-trained SSL, ASR, LM And SLU Models For Spoken Language Understanding (2022)8.09
- Style Attuned Pre-training And Parameter Efficient Fine-tuning For Spoken Language Understanding (2020)6.77
- ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken Language Understanding (2020)9.59