Zero-shot End-to-end Spoken Language Understanding Via Cross-modal Selective Self-training
2023 Β· Jianfeng He, Julian Salazar, Kaisheng Yao, et al.
Abstract
End-to-end (E2E) spoken language understanding (SLU) is constrained by the cost of collecting speech-semantics pairs, especially when label domains change. Hence, we explore \textit\{zero-shot\} E2E SLU, which learns E2E SLU without speech-semantics pairs, instead using only speech-text and text-semantics pairs. Previous work achieved zero-shot by pseudolabeling all speech-text transcripts with a natural language understanding (NLU) model learned on text-semantics corpora. However, this method requires the domains of speech-text and text-semantics to match, which often mismatch due to separate collections. Furthermore, using the entire collected speech-text corpus from any domains leads to \textit\{imbalance\} and \textit\{noise\} issues. To address these, we propose \textit\{cross-modal selective self-training\} (CMSST). CMSST tackles imbalance by clustering in a joint space of the three modalities (speech, text, and semantics) and handles label noise with a selection network. We also
Authors
(none)
Tags
Stats
Related papers
- Tie Your Embeddings Down: Cross-modal Latent Spaces For End-to-end Spoken Language Understanding (2020)9.03
- ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken Language Understanding (2020)9.59
- Speech-language Pre-training For End-to-end Spoken Language Understanding (2021)9.41
- Modality Confidence Aware Training For Robust End-to-end Spoken Language Understanding (2023)2.26
- Pre-training For Spoken Language Understanding With Joint Textual And Phonetic Representation Learning (2021)2.26
- Towards Reducing The Need For Speech Training Data To Build Spoken Language Understanding Systems (2022)8.35
- Recent Advances In End-to-end Spoken Language Understanding (2019)8.09
- Leveraging Multilingual Self-supervised Pretrained Models For Sequence-to-sequence End-to-end Spoken Language Understanding (2023)0.00