Zero-shot End-to-end Spoken Language Understanding Via Cross-modal Selective Self-training

Abstract

End-to-end (E2E) spoken language understanding (SLU) is constrained by the cost of collecting speech-semantics pairs, especially when label domains change. Hence, we explore \textit\{zero-shot\} E2E SLU, which learns E2E SLU without speech-semantics pairs, instead using only speech-text and text-semantics pairs. Previous work achieved zero-shot by pseudolabeling all speech-text transcripts with a natural language understanding (NLU) model learned on text-semantics corpora. However, this method requires the domains of speech-text and text-semantics to match, which often mismatch due to separate collections. Furthermore, using the entire collected speech-text corpus from any domains leads to \textit\{imbalance\} and \textit\{noise\} issues. To address these, we propose \textit\{cross-modal selective self-training\} (CMSST). CMSST tackles imbalance by clustering in a joint space of the three modalities (speech, text, and semantics) and handles label noise with a selection network. We also

Zero-shot End-to-end Spoken Language Understanding Via Cross-modal Selective Self-training

Abstract

Authors

Tags

Stats

Related papers