Tie Your Embeddings Down: Cross-modal Latent Spaces For End-to-end Spoken Language Understanding
2020 · Bhuvan Agrawal, Markus Müller, Martin Radfar, et al.
Abstract
End-to-end (E2E) spoken language understanding (SLU) systems can infer the semantics of a spoken utterance directly from an audio signal. However, training an E2E system remains a challenge, largely due to the scarcity of paired audio-semantics data. In this paper, we treat an E2E system as a multi-modal model, with audio and text functioning as its two modalities, and use a cross-modal latent space (CMLS) architecture, where a shared latent space is learned between the `acoustic' and `text' embeddings. We propose using different multi-modal losses to explicitly guide the acoustic embeddings to be closer to the text embeddings, obtained from a semantically powerful pre-trained BERT model. We train the CMLS model on two publicly available E2E datasets, across different cross-modal losses and show that our proposed triplet loss function achieves the best performance. It achieves a relative improvement of 1.4% and 4% respectively over an E2E model without a cross-modal space and a relativ
Authors
(none)
Tags
Stats
Related papers
- ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken Language Understanding (2020)9.59
- Zero-shot End-to-end Spoken Language Understanding Via Cross-modal Selective Self-training (2023)2.00
- Modality Confidence Aware Training For Robust End-to-end Spoken Language Understanding (2023)2.26
- Optimizing Alignment Of Speech And Language Latent Spaces For End-to-end Speech Recognition And Understanding (2021)9.03
- Speech-language Pre-training For End-to-end Spoken Language Understanding (2021)9.41
- Pretrained Semantic Speech Embeddings For End-to-end Spoken Language Understanding Via Cross-modal Teacher-student Learning (2020)9.92
- Multimodal Audio-textual Architecture For Robust Spoken Language Understanding (2023)0.00
- Leveraging Multilingual Self-supervised Pretrained Models For Sequence-to-sequence End-to-end Spoken Language Understanding (2023)0.00